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Preface 


Genetics as the formal study of inheritance was 
founded as a field following the rediscovery of 
Mendel’s work at the beginning of this century. This 
led to the first revolution in our understanding of 
inheritance, namely of the basic mechanisms of gene 
transmission, of linkage and of interpretations in 
terms of the behaviour of chromosomes in meiosis. 
The second revolution came with the discovery of 
the Watson—Crick structure of DNA just over 40 
years ago, which spelled out the chemical basis for 
the gene, and then its mode of action. Now, 
following the development of recombinant DNA 
technology and many other techniques that enable 
us to clone and sequence DNA with enormous speed 
and efficiency, we are entering a third revolutionary 
phase of genetic analysis as we approach the end of 
the century. Now is the time when whole genomes 
are being sequenced and the complete language of 
organisms is being deciphered. 

It was just over 15 years ago that the potential for 
the complete analysis of the human genome began 
to be appreciated; it came to be realized that this 
would provide enormous power for the analysis of 
all normal biological functions, as well as for the 
analysis of the basis of essentially all human disease. 
Thus developed the Human Genome Project, and 
alongside it many other genome projects. 

The rate of advance of the technology and the 
acquisition of new data could not, I believe, have 
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been predicted even by the wildest speculator. In 
1986, I suggested that the project to catalogue and 
sequence all human genes and place them in their 
positions along the chromosomes be billed as 
‘Project 2000’. That prediction we can now see will 
soon be realized. 

Almost daily, new genes are discovered, while 
many exist and are waiting to be discovered in the 
databanks of genomic and, especially, partial cDNA 
sequences. The production and analysis of this 
extraordinary accumulation of information requires 
a wide variety of complex techniques; from ap- 
proaches to the statistical problems of the analysis 
of complex human pedigrees, to the determination 
of DNA sequences. This Handbook provides an 
invaluable guide to the wide range of these tech- 
niques and is practical and usable. It has required 
an enormous effort on the part of the authors 
and, especially, the editors, to put together this 
most valuable companion and all ought to be 
congratulated on the achievement. 

Only 5 years ago when we were organizing a new 
form of international Human Gene Mapping Work- 
shop in London, it was hard to convince the pharm- 
aceutical industry that they should be interested. 
Now, not only is there a huge and burgeoning 
biotechnology industry, but no major pharma- 
ceutical company can afford any longer not to invest 
in a major way in genome analysis, and many are 
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accepting that this is where their future lies. The 
opportunities are enormous but the challenges now 
are to work with the genes and to understand their 
functions, and that may take perhaps another 
century or more to achieve. I am sure that this 


Handbook will make an important contribution 
towards that end. 


Walter Bodmer 
ICRE, Laboratory Head 


Introduction 


The ICRF Handbook of Genome Analysis is a 
combination of protocol manual and informational 
resource, drawing on the expertise of researchers at 
ICRF and elsewhere. It describes and evaluates a 
wide range of techniques pertinent to genome 
analysis. The first volume comprises a description 
and evaluation of strategies, techniques and proto- 
cols for use in the genetic and physical mapping of 
the human genome (Chapters 1-19). Genome 
analysis techniques are also used widely in the 
study and diagnosis of cancers and other diseases, 
and some of these applications are also covered. A 
glossary of abbreviations and acronyms is included 
at the end of Volume 2. 

The second volume includes a comprehensive 
review section of approaches to DNA sequencing 
(Chapters 20-25) and reviews of progress in the 
analysis of the genomes of important model systems 
(Chapters 26-34). Organisms covered include the 
mouse, Drosophila, Caenorhabditis elegans, Saccharo- 
myces cerevisiae (the first eukaryote organism to have 


XIX 


its genome fully sequenced), Escherichia coli, 
Arabidopsis thaliana and rice. The second volume 
concludes with chapters on information resources 
and how to access them (Chapters 35-37) and 
appendices covering materials, preparation of blood 
samples, suppliers and other useful addresses, 
extensive tables of mapped human disease genes 
and mouse knockouts, and tables of chromosomal 
aberrations associated with cancer. An index to the 
complete handbook is included at the end of each 
volume. 

One of the main driving forces behind the effort 
to map and sequence the human genome is the 
isolation and characterization of human disease 
genes. The figure on the following page shows the 
typical stages in such an enterprise and the relevant 
chapters in the Handbook that deal with the tech- 
niques involved. 


Nigel K. Spurr 
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Stephan Beck 


Introduction 


The Sanger Centre, Hinxton, Cambridge CB10 1SA, UK 


The genotype of all living organisms is represented 
by nucleic acids in the form of either DNA (deoxyri- 
bonucleic acid) or RNA (ribonucleic acid). There- 
fore, it is not surprising that DNA/RNA sequence 
analysis is so fundamental to genome analysis and 
the understanding of biological processes in general. 
The technical breakthrough for DNA sequencing 
came in 1977 when Maxam and Gilbert described a 
method for sequencing by base-specific chemical 
degradation [1] and Sanger and coworkers describ- 
ed a method for enzymatic sequencing using chain- 
terminating inhibitors [2]. Although the under- 
lying concept of both methods has not changed, 
many different strategies, modifications and speci- 
alized protocols have been developed over the 
years. These improvements have made large-scale 
sequencing possible and genome sequencing pro- 
jects feasible. The largest genome sequenced to 
date is that of the yeast, Saccharomyces cerevisiae 
(15 Mb) [4], while the longest contiguous sequence is 
from the nematode Caenorhabditis elegans (18 Mb) [5]. 
In addition, over 8000 of the estimated 100000 
human genes have already been sequenced [5], 
among them many disease genes. 

The main aim of this section of the Handbook is to 
help investigators select the best strategy for a 
particular project by providing critical reviews as 
well as protocols of the currently available se- 
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quencing methods, and by sharing personal 
experience and tips. As illustrated in Fig. 1.4.1, 
different codons can code for the same amino acid. 
By analogy, different methods can be used to 
sequence a particular DNA fragment. In some cases, 
the method of choice is just a matter of personal 
preference, whereas in others, potential problems 
can be avoided if the right questions are considered 
early on. Whatever the size and type of the se- 
quencing project, it is well worth spending a little 
time thinking about the best strategy. A typical 
sequencing project is subdivided into multiple, indi- 
vidual steps such as cloning, template preparation, 
labelling, sequencing, electrophoresis, detection, 
band calling, editing and analysis. The subsequent 
chapters (Chapters 20-25) discuss various aspects 
and options for the stages in DNA sequencing, 
starting with the most basic choices, such as random 
versus ordered sequencing, in vivo versus in vitro 
template amplification, enzymatic versus chemical 
sequencing, radioactive versus nonradioactive 
detection. 

Factors that may affect the choice of a particular 
strategy are cost, convenience, accuracy and the 
complexity of the DNA to be sequenced. Simple 
clone identification or generation of expressed 
sequence tags (ESTs), for instance, may not require 
the same accuracy as the generation of novel 
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sequence (where sequencing of both strands is 
imperative) and a high GC or repeat content of the 
target DNA can restrict the choice of applicable 
strategies. The general choice of sequencing 
strategies is covered in Chapter 20 (C. Churcher 
et al.), which discusses the advantages and 
disadvantages in different situations of strategies 
such as shotgun sequencing, primer-walking, 
transposon-mediated sequencing and the use of 
nested deletions. For high-quality and efficient DNA 
sequencing, all steps of a chosen strategy not only 
have to work individually but also have to be 
compatible with each other. A good example in this 
context is the step of template preparation. It is well 
known that the ability of DNA templates to se- 
quence well cannot be reliably predicted from 
agarose gels or optical density measurements, 
although both methods can be quite helpful in 
finding out why certain templates do not sequence 
well. In addition, different strategies have different 
requirements, and therefore the appropriate tem- 
plate quality is usually determined empirically. A 
comprehensive selection of 18 different protocols for 
template amplification and purification is described 
in Chapter 21 (A. Rosenthal et al.). 

Chapter 22 (P. Heinrich and H. Domdey) discusses 
both chemical degradation and dideoxy sequencing 
chemistries. On the basis of the Science Citation 
Index (1981-1994), dideoxy DNA sequencing 
appears to be the most frequently used technique in 
molecular biology (about 33 500 citations in ref. 2 
compared with about 3500 citations for the 
polymerase chain reaction (PCR) of ref. 7). This 
enormous success is partly due to the excellent 
commercial support for DNA sequencing techno- 
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logy and its increasing automation. Chapter 22 
concludes with a brief look at some potential future 
sequencing technologies [8], such as sequencing 
by hybridization, by mass spectrometry and by 
high-resolution microscopy. Chapter 23 (C. Heller) 
discusses the principles and practice of slab gel 
electrophoresis and provides some basic protocols. 
In Chapter 24, P. Richterich discusses the advantages 
and disadvantages of various methods of sequence 
labelling and detection, including enzyme-linked 
detection methods. In the final chapter of this section, 
A. Milosavljevic (Chapter 25) describes the use of 
the program PYTHIA to detect and characterize 
repetitive sequences in DNA. Many programs for 
sequence management and sequence analysis are 
freely available via WWW, anonymous ftp or e-mail 
from resource centres such as EMBL, NCBI and the 
UK-HGMP (see Appendix V and Chapters 35 and 37 
for addresses and guidance on access). 
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Sequencing strategies 


Carol Churcher, Mary Berks, Sharen Bowman, 
David Buck & Karen Thomas 


The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK 


This chapter is dedicated to the memory of Dr Mary Berks who died on 12 


May 1996. 
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20.1 Introduction 


DNA sequencing has become a widely used tool in 
many molecular biology laboratories. Its numerous 
applications range from sequencing tens or hun- 
dreds of bases to verify a cloning step, through a few 
kilobases for many gene-sized projects, to kilobases 
and megabases for whole-genome projects. The 
choice of sequencing strategy to be employed can be 
based to some extent on project size but there are 
other considerations such as finances, available 
equipment, and expertise within the laboratory. 
There are two groups of methods which can be used 
for the generation of DNA fragments for sequen- 
cing: random and ordered. The random methods 
use restriction enzyme digestion or shearing to 
produce fragments. The ordered or direct methods 
commonly used are nested deletions, primer walk- 
ing, and transposon-facilitated sequencing. Each 
system is discussed with regard to its applications, 
advantages and disadvantages. Multiplex sequenc- 
ing is presented as a method that can utilize DNA 


fragments generated either randomly or by a direct 
approach. 


20.2 Shotgun sequencing 


The shotgun approach (Fig.20.1) is a random 
method for DNA sequencing. The DNA is randomly 
fragmented, each fragment is sequenced, and these 
‘short reads’ are then re-assembled in order, gener- 
ating the original DNA sequence. With the intro- 
duction of more automated methods for sequence 
data collection, the shotgun approach is gaining 
popularity and is the method of choice for the 
majority of large-scale sequencing projects already 
under way [1-3] (see, for example, Chapter 29). 

The stages needed for a shotgun project are 
described in Sections 20.2.1-20.2.4. 


20.2.1 Library preparation 


The quality of library production is critical to the 
success of a shotgun project. It is necessary to 
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Fig.20.1 Shotgun sequencing. 
The target DNA is fragmented 
using the method of choice, and 
fragments are size-selected on an 
agarose gel. The fragmented 
DNA is then blunt-ended using a 
suitable enzyme such as mung 
bean nuclease. Vector DNA is 
digested with a suitable enzyme 
and treated with calf intestinal 
alkaline phosphatase (CIP) to 
reduce self-ligation. Insert and 
vector are ligated together and 
then transformed into E. coli and 
plated out. Individual plaques 
are prepped, sequenced and the 


‘ reads assembled to form contigs. 
i Finally, the cosmid is 
H ‘contiguated’ using directed 
ahs methods such as primer walking 


to reconstruct the original 


cosmid sequence. 
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fragment the DNA to be sequenced as randomly as 
possible. If the vector DNA comprises a significant 
percentage of the total DNA (for example, a lambda- 
clone) then it is advisable to purify the DNA insert 
before proceeding any further. For a cosmid project 
this is not necessary, as the cosmid vector makes up 
only about 15% of the total DNA, and can provide a 
convenient internal control to check sequencing 
accuracy. Several methods can be used to fragment 
DNA [4-6], but none has been proved to do so ina 
completely random manner. Sonication is most 
frequently used, which preferentially breaks DNA in 
A/T-rich regions or close to the ends of a linear DNA 
fragment. It is therefore necessary to end-ligate 
linear DNA before the sonication step to minimize a 
nonrandom distribution of fragments. To ensure 
random breakage the DNA solution must be kept as 
cold as possible (0-4 °C) and sonication restricted to 
short bursts. Calibration of the sonicator is necessary 
to achieve the desired size range of fragments. After 
repairing the damaged ends of the DNA with an 
enzyme such as mung bean nuclease, the DNA is 
run on an agarose gel and fragments in the desired 
size range excised and purified. A fragment range 
of 1.4-2kb is optimal: it is greater than the long- 
est achievable read length, is relatively stable in 
commonly used sequencing vectors, and gives some 
flexibility for gap closure in the latter stages of a 
shotgun project. Fragments are then blunt-end 
ligated into the sequencing vector, usually the 
single-stranded phage vector M13 and transformed 
into a suitable Escherichia coli host. A test plate of 
each library should be generated, scored for the ratio 
of insert containing/no insert clones, and it is 
advisable, prior to large-scale data production, to 
sequence some of the clones produced to confirm 
library quality and randomness. 


20.2.2 Sequencing 


The difficulties presented in the sequencing phase of 
a shotgun project are mainly of scale. A large 
number of sequencing templates of consistently 
high quality must be prepared. Each template is 
treated in exactly the same way, so automation of 
both template preparation and sequencing reactions 
is possible [7]. A vector-specific universal primer is 
used for all templates, eliminating the need for 
costly custom primer synthesis. If 960 sequencing 
templates are processed for a cosmid project, after 
sample losses due to reaction failures or presence of 
cloning (cosmid) or sequencing (M13) vector, an 
average of 700 useful sequences remain. For a 
cosmid with an insert size of 35-40kb, this would 
give a sequence redundancy of five- to sixfold after 
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assembly. If possible, sequencing reactions should 
be loaded on a system allowing automatic data 
collection and computer entry, as it is necessary to 
process many samples for a single project. 


20.2.3 Assembly and editing 


Sequences generated by a shotgun project are ran- 
dom, and do not normally exceed 500 bp in length. It 
is essential to have a sequence assembly software 
package to process the individual sequences and 
assemble them into contigs [8]. If repetitive stretches 
of DNA are present (see Chapter 25), more stringent 
criteria must be used for data assembly. After a 
suitable number of reads have been entered, then 
individual sequences can be compared and edited to 
remove miscalls and identify sequence-dependent 
problems. 


20.2.4 Directed sequencing 


It is rare for the sequence data from the shotgun 
phase of a project to assemble into one contig of the 
size expected for the cosmid insert. Usually the data 
are left in a small number of contigs, and a more 
directed approach must be applied in order to fill the 
remaining gaps. If the reason for the remaining gaps 
was purely due to the random nature of clone 
distribution intrinsic to this method, then entering 
more shotgun reads would eventually fill the gaps. 
However, in practice, gaps seem to occur for other 
reasons, some due to cloning problems in M13 or 
regions difficult to sequence. Gaps can be filled by a 
variety of techniques, including primer walking, or 
using the polymerase chain reaction (PCR) and 
reverse primer to sequence the other end of clones 
lying at the end of contigs, or sequencing a PCR 
product generated from the original cosmid DNA. 
Other sequence-dependent problems will also 
remain, and must be resolved using a directed 
approach. 

The shotgun approach to a sequencing project 
has several advantages over a directed strategy. 
Sequence acquisition is rapid: more than 95% of the 
sequence can be generated rapidly in the initial 
shotgun phase of the project. A single primer is used, 
so oligonucleotide synthesis costs are low. The 
redundancy inherent in this method gives a high 
level of confidence that the final sequence generated 
is correct, as each base is sequenced five or six times 
on average. However, producing redundant data is 
not an efficient way of generating a sequence—a 
directed approach would be more efficient in this 


- respect. The library preparation stage of a shotgun 


project is vital, and can prove difficult. It is necessary 
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to have the ability and equipment to process a large 
number of samples and automatically enter each 
sequence into a computer for processing. Problems 
can arise during data assembly if the DNA 
sequenced has internal repetitive regions. Even after 
the shotgun phase, a directed approach must be 
applied to completely contiguate the project and 
solve any problems. Nevertheless, the shotgun 
approach to sequencing has proved to be a 
dependable method for the rapid, automatable 
generation of large amounts of sequence. 


20.3 Primer walking 


In contrast to the random shotgun approach de- 
scribed above, primer walking is a totally directed 
strategy (Fig. 20.2). It initially involves the use of a 
custom-made oligonucleotide to prime synthesis 
from a known site on the DNA template. Subsequent 
primers are designed according to the sequence that 
is obtained from successive sequencing reactions 
until the complete sequence has been elucidated. 

In theory, such a totally directed approach to 
sequencing represents the most efficient strategy, 
since only a minimal set of sequencing reactions is 
performed and redundancy is kept relatively low. In 
practice, however, there are problems. First, the cost 
of manufacturing even the very small amounts of 


oligonucleotide required for each sequencing 
reaction is high. In addition, the method relies on 
individual primers annealing to unique sites on the 
DNA template, which presents a problem where 
DNA is repetitive. There is also a problem in the 
variation of sequence quality and read length when 
using custom-made oligonucleotides compared 
with sequencing methods relying on universal 
primers which tend to be more consistent. 

Primer walking is least suitable for large-scale 
sequencing projects where many different primers 
would be required and where problems are most 
likely to arise from repetitive DNA sequences. It 
could, however, be the method of choice if the size of 
cloned DNA to be sequenced is small, especially 
relative to the vector. In such cases, primers can be 
designed (or may be commercially available) to 
known vector sequence flanking the DNA insert, 
and further primers can be synthesized as described 
above to complete the sequence of the insert. There 
is thus no unnecessary sequencing of vector and, 
with a small insert size of around 2-3kb, it should be 
possible to obtain sequence on both strands with as 
few as 10 different oligonucleotides. Another advan- 
tage of primer walking is that there is no problem 
with assembly (a major consideration in random 
shotgun approaches) since sequencing always pro- 
gresses from a known point on the DNA. 
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Fig.20.2 Sequencing by primer 
walking. Double or single- 
stranded DNA preparations of 
the target DNA (in this case a 
cosmid with insert) are made. 
The first round of sequencing 
can be carried out using 
universal primers to the vector 
sequence. This yields sequence 
data for the insert DNA. Custom 
oligos are then chosen and used 
to extend the insert sequence 
data, typically by about 400 bp. 
This new data is used to choose 
further custom oligos, and the 
procedure is repeated until one 
has ‘walked’ the entire length of 
the insert DNA. 
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Sequencing single- or double-stranded DNA with 
(unlabelled) custom-made oligonucleotides can be 
conveniently carried out with fluorescently labelled 
dideoxy-terminators using either T7 polymerase 
(Sequenase) or Taq polymerase in the sequencing 
reactions. The labelled products can then be run 
on ABI 373A automatic gel readers. One major 
advantage of these chemistries is that they are 
efficient at resolving severe compressions and 
eliminating stops, both of which are common 
problems in reactions using fluorescently labelled 
universal primers (as in random shotgun strategies). 
However, as mentioned above, the quality of the 
reads may not be sufficiently consistent. 

An attempt has been made in recent years to 
address one of the major drawbacks of a primer 
walking strategy: the cost of primer synthesis. Two 
independent groups have reported [9,10] that 
strings of three adjacent hexamers, and even pen- 
tamers, could prime DNA sequencing reactions 
uniquely without the need for ligation of the 
adjacent oligonucleotides. In one case, the DNA 
template was saturated with a bacterial single- 
stranded binding (SSB) protein which suppresses 
priming by individual hexamers and most pairs of 
hexamers but stimulates priming by the 3’ hexamer 
of most strings of three or more contiguous 
hexamers. The second case, described by its 
developers as ‘modular primer walking’, utilizes 
pentamers and heptamers without the need for SSB 
protein. Currently, a pentamer—heptamer—heptamer 
array has been found to be most successful. In this 
case, libraries are required of both pentamers and 
heptamers. Since the two heptamers in the array 
have two degenerate positions each, the size of the 
heptamer and pentamer libraries is the same, at 512 
sequences each, to cover all possible combinations. 
In the case of the hexamer library described above, 
a minimum set of 4096 hexamers is required. 
Although there has been some success using this 
strategy for sequencing parts of single-stranded 
M13 DNA and double-stranded T7 DNA, it remains 
to be seen whether or not it will be practical in 
large-scale sequencing projects. The drawbacks of 
multiple priming to repetitive regions of DNA and 
possible inconsistent quality of data remain. 

Primer walking has been successfully used in 
conjunction with a primarily random shotgun- 
based approach in sequencing of Caenorhabditis 
elegans [1,11]. In this large-scale sequencing project, 
random reads from M13 templates (derived from 
cosmid DNA) are first carried out to moderate levels 
of redundancy (fivefold) and this is followed by a 
directed primer walking strategy to achieve gap 
closure and to complete double-stranding. 


In summary, the main advantages of primer 
walking are in the sequencing of relatively short 
stretches of DNA (where the vector is larger than the 
DNA insert itself) or as a method of completing 
sequence data following a random shotgun-based 
strategy in large-scale sequencing projects. 


20.4 Transposon-mediated 
sequencing 


Shotgun sequencing has the disadvantage of gener- 
ating high redundancy and is wasteful of reagents 
and computer time. In contrast, primer walking is 
low in redundancy but costly in the requirement for 
large numbers of oligonucleotides, and is proble- 
matic with regard to repetitive sequences. Another 
approach takes advantage of transposons for direct- 
ed sequencing of DNA and is both low in redun- 
dancy and requires only two universal primers. 

Transposons are specialized DNA segments that 
can move randomly to many sites in a DNA 
molecule. For use in sequencing, transposons have 
been engineered to contain the binding sites for the 
universal sequencing primers from M13. They can 
be used in two ways, first as simple mobile universal 
primer-binding sites, and second as a means of 
generating nested deletions. Both random and 
ordered transposon-mediated techniques have been 
developed. The standard mobile site transposon 
approach is better suited to smaller DNA targets 
because of the requirement to map the site of 
insertion, whereas the nested deletion approach is 
good for large cosmids because of the ease of 
mapping of the deletion end points. 


20.4.1 Transposons as mobile priming sites 


The random insertion of a transposon into a target 
DNA does not disrupt the original linkage among 
the component parts and can therefore provide 
access to all positions without recourse to ‘shotgun’ 
subcloning or for extensive primer walking [12,13]. 
The Tn3 family of transposons, which includes 
Tn3 and 6, display little sequence specificity for 
transposition and have been used in plasmid 
sequencing [12,14,15] as a mobile primer site and for 
the generation of nested deletions. Their presence 
can be selected for by a simple bacterial mating. This 
selection is based on the formation of a cointegrate 
as the initial product of transposition in the donor 
cell, and resolution of this cointegrate after it has 
transferred to the recipient cell (Fig.20.3). The final 
product is a simple insertion of the transposon 
bracketed by a 5-bp direct repeat of target DNA. The 
Drosophila genome sequencing project (see Chapter 
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Fig.20.3 The transposition of y5 and mini-y6 (my6). 
Selection for insertion into a non-conjugative plasmid by 
mobilization and conjugal transfer. (a) Transposition 
from donor F factor (for wild-type miniyé, my) to a non- 
conjugative plasmid forming an F: plasmid cointegrate. 
(b) Transfer of cointegrate to a recipient plasmid-free cell 
via conjugation, followed by resolution to yield a 
plasmid containing one copy of myé, and the donor F 
factor molecule. 


28, Section 28.3.3.4) has developed a transposon- 
facilitated system [16]. The DNA fragment to be 
sequenced is subcloned into a minimal plasmid and 
sites of Yd transpostion mapped by PCR [17] so that a 
minimal set of sequencing templates can be rapidly 
obtained. Over 1.25Mb have been sequenced in a 
2-year period using this method. 

One of the E. coli genome sequencing projects (see 
Chapter 31) utilizes a complete set of mapped and 
overlapping A-phage clones [18]. The sequencing of 
these clones is being carried out using a strategy 
employing a Tn5-derived minitransposon devel- 
oped by Kasai et al. [19]. The small minitransposon is 
necessary for sequencing A-clones because the size 
of typical transposons is close to the maximum 
capacity for the phage head. Tn5 is ideal because all 
it needs for transposition into A is a pair of Tn5 
inverted repeats (19bp) and a selectable marker, 
such as the suppresser tRNA gene supF (Tn5supF 
elements are about 300bp long and contains only 
supF, primer-binding sites and the 19-bp terminal 
repeats) (Fig. 20.4). Although it inserts less randomly 
than Tn3 it does not require the formation of a 


Tn 5supF 
donor plasmid 


Plaque formation on 
DNABamber Strain 


Fig.20.4 Transposition of Tn5supF to phage lambda. 
Selection for Tn5supF insertion in phage lambda (A) by 
plaque formation on AnaB ner E. coli strain. 
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cointegrate intermediate (which would necessitate 
the formation of a A-lysogen). The transposon 
Tn5supF is delivered in a single cycle of A-phage 
infection of donor cells, and transposon containing 
phage selected by culturing on dnaBynpe, cells [20]. 
The transposition product is a simple insertion of 
Tn5 bracketed by a 9-bp direct repeat of a target 
DNA. 

Although transposons can provide mobile bind- 
ing sites for sequencing primers, sequence acquisi- 
tion is random and can be highly repetitive. To 
reduce redundancy the position of insertion must 
first be mapped. This can be achieved by restriction 
digest followed by southern hybridization or by the 
use of PCR [19,21]. 


20.4.2 Transposon-generated deletions 


The use of transposons to generate deletions in a 
target DNA was a strategy developed by Ahmed 
using Tn9 [22]. Deletions are generated by intra- 
molecular transposition to new sites within the same 
plasmid. This causes a division of the plasmid with 
only the portion containing the plasmid origin of 
replication being recoverable. The original Tn9 
strategy had the disadvantage of nonrandom 
insertion and the inability to obtain sequence from 
both strands. These problems were addressed by 
Wang et al. [23] in the development of a new 6 
transposon which transposes more randomly and 
allows recovery of deletions extending into a cloned 
fragment in either direction. 

The transposition of y6 is replicative, with one 
entire copy of the transposon ending up in each of 
the reciprocal deletion derivatives. Both derivatives 
are made viable by inserting a plasmid origin of 
replication within the transposon ends. Selection for 
deletions in either direction is made possible by the 
incorporation in the transposon vector of cotran- 
selectable marker genes (sacB*, for sucrose sensi- 
tivity, and strA* for streptomycin sensitivity) just 
outside each end of the transposon, and selectable 
kan* (Kan*) and tet* (Tet') genes between the cloning 
site and sacB and strA, respectively. Selection on 
sucrose tetracycline medium yields deletions ex- 
tending from one end, while selection on strep- 
tomycin kanamycin medium yields deletions in the 
other direction (Fig.20.5). 

Orientational deletions can be selected, none of 
which extends beyond the end of the insert DNA. 
After transposition, one end of the transposon 
always abuts a deletion end point and can serve asa 
‘universal’ primer-binding site. Deletion end points 


are mapped by plasmid size, allowing selection of 
end points in any region. 
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Fig.20.5 Transposon-generated 
deletions in pDUAL. (a) pDUAL 
with cloned fragment of DNA. () 
Selection of clockwise (b) and 
counterclockwise (c) deletions. 
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20.5 Nested deletions 


Another directed method involves the generation 
of nested deletions by sequential digestion. The 
simplest such method begins with the cleavage of 
double-stranded DNA at a unique site, shortened by 
enzymatic digestion with Bal31 [24] or exonuclease 
III, followed by either S1 nuclease or exonuclease 
VII. 

Bal31 digests double-stranded linear DNA pro- 
gressively by liberating mononucleotides. The 
drawback of Bal31 is that it will digest both ends of 
the linearized fragment, necessitating the recloning 
of the target sequence. Exonuclease III, on the other 
hand, catalyses the stepwise removal of 5’ mononu- 


cleotides from from double-stranded DNA with a 
protruding 5’ terminus, or blunt end. Ends with 
protruding 3’ termini are untouched by the enzyme 
[25]. Judicial choice of restriction enzymes can lead 
to directed deletions from one end only leaving the 
vector intact and thus avoiding the necessity for 
recloning [26,27]. For this reason, exonuclease III is 
the enzyme of choice for the generation of nested 
deletions. Figure 20.6 illustrates the procedures 
involved in the generation of nested deletions using 
exonuclease III. 

An obvious advantage to this approach is the 
ability to use a universal primer for all the deleted 
fragments. Thus, sequencing can be carried out 
using either fluorescent or radioactive sequencing 


Fig.20.6 Generation of nested 
deletions. Double-stranded 
DNA is digested to leave the 
insert susceptible to exonuclease 
III (ExolIII) digestion while the 
vector remains ‘safe’. DNA is 
treated with Exolll for variable 
lengths of time or at different 
temperatures to yield a range of 
deleted products. Mung bean or 
S1 nuclease is used to blunt-end 
the DNA. Fragments are size- 
selected and recircularized prior 
to transfection. The universal 
primer can then be used to 
sequence stepwise across the 
insert. 
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without the production of costly primers. Thus the 
method is ideal for small molecular biology labs 
working ona tight budget. 

Exonuclease III is a nonprocessive enzyme with a 
very stable reaction rate. This allows one to control 
the rate and extent of deletions by manipulation of 
reaction temperatures and times (Table 20.1). Thus, 
progressive deletions of 200-250bp can be repro- 
d uc ibly achieved. 

Because exonuclease III deletions are so con- 
trollable, the redundancy can be tailored to meet 
the demands of the project. In addition, if suitable 
clones are chosen, both strands can be covered by 
the use of two sets of deletions. 


} 


20.5.1 Applications of nested deletions to 
sequencing projects 


Exonuclease III deletions can be carried out using 
either M13 or phagemid clones. Thus projects up to 
about 9 kb are entirely feasible. The limitation of this 
technique is the distance over which deletions can 
be obtained. Use of the nested deletion strategy has 
been reported for a number of CDNA and small-scale 
genomic sequencing projects, for example the 
sequencing of the mouse myelin P2 protein gene 
(4 kb) [28]. 

Single-stranded DNA can also be used to create 
exonuclease III deletions, when oligonucleotide 
hybridization is used to facilitate restriction digest 
linearization of the target DNA [29,30]. Dale et al. 
[29] reported the generation of deletions for single- 
stranded M13 clones. They used the method to 
sequence 2.6kb of the maize mitochondrial 18 S$ 
tDNA and its 5’ flanking region ‘in less than a week’. 

Although the method would not be ideal for 
large-scale genome projects, it has obvious advan- 
tages when dealing with ‘gaps’ or repeats. The 
controlled deletion strategy means that one knows 
at the assembly stage which copy of the repeat one is 
dealing with. This ‘mapping’ data obviously does 
not exist with a random shotgun approach. Thus, it 
is a powerful method to be used alongside, or in 


addition to, a traditional shotgun approach for such 
projects. 


Table 20.1 Exonuclease III deletion rates, based on 20 
units per microgram of double-stranded DNA. 


Temperature (°C) Deletion rate (bp min“) 
37 400 
34 875 
30 230 
23 125 


“ap Geass ee 


Additionally, nested deletions can been used to 
facilitate the sequencing of difficult regions, for 
example very GC-rich sequences [31]. 


20.6 Multiplex sequencing 


Multiplex sequencing is a variant of the shotgun 
method in which a number of samples are pooled 
during processing (thereby saving labour) and 
separated by hybridization detection at the end of 
the process. Like simple shotgun sequencing it can 
be used for any size of project, including genomic, 
YAC, P1, cosmid, plasmid, or cDNA. It can utilize a 
variety of DNA fragments such as restriction, 
shotgun, nested deletion, and PCR product (Chapter 
21). DNA can be either single or double stranded. 
Sequencing chemistry (Chapter 22) can be either 
chemical or dideoxy using a variety of polymerases. 
Gels can be capillary, electroblotted, or direct 
transfer electrophoresis (DTE) may be used (see 
Chapter 23). Sequence detection can be either 
radioactive or chemiluminescent (Fig. 20.7). 

Multiplex sequencing was first described by 
Church and Kieffer-Higgins [32] as a method of 
sequencing and _ electrophoresing many DNA 
samples in each set of four lanes on a gel and then 
probing as many times as there are samples in each 
set. The procedure utilized a set of 20 vectors, each 
containing two unique oligonucleotide sequences 
inserted either side of a cloning site. Sonicated 
genomic DNA fragments were cloned into each of 
these 20 vectors and transformed. The constructs 
were then pooled, and DNA preparations carried 
out by alkaline lysis in groups of 20. Chemical 
sequencing by the Maxam and Gilbert method [33] 
was used. After gel electrophoresis the DNA was 
electro-eluted onto a nylon membrane and cross- 
linked by UV irradiation. Probing of the sequences 
was carried out by “P labelling of oligonucleotides 
(oligos) complementary to those used in the vector 
constructs and probing the membranes syste- 
matically with each oligo. Bands were visualized by 
autoradiography using standard X-ray film. This 
method allowed 40 different sequences to be read 
from each set of reactions. 

Recently, much research has been carried out to 
find alternative nonradioactive methods of probing 
membranes. For reasons of safety and _ storage, 
chemiluminescent probes are now preferred by 
many laboratories. Compared with radioactive 
materials, these nonradioactive labels present no 
disposal problem, require much shorter exposure 
times (typically, 10-15 min, whereas radioactive 
labels require hours or even days) and are easily 
incorporated into existing protocols without the 
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Fig.20.7 Multiplex DNA 
sequencing. (a) Subcloning of - 
target DNA. Different DNA 
fragments are subcloned into a 
set of different sequencing 
vectors. (b) Preparation of DNA 
and sequencing reactions. 
Templates are pooled, then 
grown, prepped and sequenced 
in the same way one would treat 
individual samples. (c) 
Determination of individual 
sequences. Sequencing gels are 
transferred to membranes, 
which are hybridized 
sequentially with probes specific 
for each different vector, 
revealing the sequences for the 
individual clones. 
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need for expensive equipment. There have been 
several reports of changes to the original multiplex 
procedure and protocols are available for the use of: 

1 multiplex ‘tagged’ vectors; 

2 single vector, multiplex ‘tagged’ primers; 

3 single vector, multiplex probe labelling. 


20.6.1 Multiplex ‘tagged’ vectors 


Multiplex vectors which can be used with the com- 
monly used Sanger dideoxy sequencing technique 
[34] are now available [35]. Ten-plex vectors based 
on the original method of incorporating unique 
oligonucleotide sequences into the vector construct 
can be used in double-stranded sequencing with 
forward and reverse primers to produce 20 different 
sequences to be loaded in each set of four lanes on 
the gel. Chemiluminescence is used to visualize 
bands. After the gel has been blotted, biotinylated 
DNA probes which are complementary to the plex 
tagging sequences are hybridized. 

After hybridization, biotinylated alkaline phos- 
phatase is linked to the bound DNA probe through a 
streptavidin bridge. The chemiluminescent reaction 
is performed by adding dioxetane, which produces 


light at 477nm and allows the sequence band 
patterns to be recorded by exposure to X-ray film. 


20.6.2 Single vector, multiplex ‘tagged’ primers 


A single vector system using ‘tagged’ primers has 
been described [36].. This method uses standard 
M13/dideoxy sequencing reactions on single- 
stranded DNA and a set of eight ‘tagged’ primers. 
These are 37mers synthesized with the M13 forward 
primer sequence at the 3’ end and a series of 
different 20mers at the 5’ end. Sequencing reactions 
are set up with each of the eight ‘tagged’ primers 
then pooled and electrophoresed. The gel is blotted 
then probed with 20mer oligonucleotide sequences 
(complementary to each of the ‘tagged’ primers) 
which have been labelled with digoxigenin and 
detected using antidigoxigenin antibody—alkaline 
phosphatase conjugate and chemiluminescent dio- 
xetane substrate. Sequence is recorded by a 10-15 
min exposure to X-ray film. 


20.6.3 Single vector, multiplex probe labelling 


A method using a single vector and primer with one 
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Table 20.2 Summary of sequencing strategies. 


Method Advantages Disadvantages 
Random 
Shotgun Rapid data acquisition High redundancy 
Utilizes universal primer Rarely completely random 
Suitable for most sizes of project 
Ordered 


Nested deletion Low redundancy 


Unsuitable for large-scale projects 


Useful for repetitive sequence and gaps 


Primer walking Low redundancy 
No subcloning required 
Transposon-mediated Low redundancy 


Utilizes universal primer 
Suitable for any size of project 
Random or ordered 
Multiplex 
Suitable for any size of project 


of four different hapten labels attached is described 
[37]. This method allows the use of familiar 
vector/primer systems by labelling the primer, this 
system is comparable with the fluorescent primers 
used in the ABI373 fluorescent sequencing mach- 
ines. The hapten labels currently used are biotin, 
digoxigenin, 2,4-dinitrophenyl and fluorescein. 
Four separate sequencing reactions are set up with 
these primers then all T reactions are pooled, all C 
reactions, etc. Sequence products are simultane- 
ously electrophoresed and blotted by DTE as 
described by Beck and Pohl [38]. DTE involves the 
movement of a membrane across the bottom of the 
gel as it is running, onto which the DNA is trans- 
ferred as it elutes from the gel. Bands are detected by 
sequential probing with either hapten-specific 
alkaline phosphatase or streptavidin-alkaline phos- 
phatase conjugate and a dioxetane solution. 

The choice of vector/primer to be used in 
multiplex sequencing is dependant on the size of the 
project, available equipment and knowledge of 
sequencing methodologies. Small projects could 
quite easily be set up using methods 2 or 3. All 
protocols can be carried out using commonly 
available laboratory equipment, although for high 
throughput automation is desirable, which would 
increase the cost. The use of commonly available 
sequencing vectors requires no multiple cloning 
steps so can be set up in any molecular biology 
laboratory. Using ‘tagged’ primers currently allows 
8-plexing though one could increase this number by 
synthesizing more primers. The hapten-labelled 
primer system allows 4-plexing but work is 
continuing to find different labels so this number 
could increase. For large-scale sequencing projects 
the ‘tagged’ vector method would seem to be the 


Rapid data acquisition from minimal number of gels 


Cost of custom primers 
Unsuitable for large-scale projects 
Large-scale mapping may be required 


Gel-reading bottleneck 


most efficient and this has been demonstrated in 
many cases, for example with Mycoplasma [39], 
Mycobacterium [40] and E. coli [32]. The main 
disadvantage of these systems would seem to be 
interpreting and recording the sequence from the 
autoradiographs. For a small project a sonic digitizer 
[41] is quite adequate, but for genome-sized projects 
film reading is a definite bottleneck. An adequate 
and reliable automated system has yet to be 
demonstrated although such a system is described 
by Richterich and Church [42]. 

A summary of the strategies described in this 
chapter is given in Table 20.2. 
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21.1 Introduction 


DNA sequencing is a complex process requiring 
several complicated steps such as DNA cloning, 
mapping of cloned fragments, subcloning, template 
generation, sequencing reactions, gel electrophore- 
sis, transferring sequence data into the computer, 
data analysis and data handling. The availability of 
high-quality DNA templates in large numbers is 
important for the success of any DNA sequencing 
method. It is therefore not surprising that a wide 
variety of methods and techniques has been devel- 
oped over the past 15 years to generate templates by 
bacterial amplification or by the polymerase chain 
reaction (PCR). Many groups, including those 
engaged in large-scale sequencing projects, still 
favour M13, phagemid or plasmid templates which 
are obtained in vivo by bacterial growth. In vitro 
amplification by PCR has significantly changed the 
way in which DNA can be produced and PCR and 
DNA sequencing are more and more linked together 
in modern strategies. 
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This chapter presents a selection of standard, 
improved and new protocols for the in vivo 
(Fig.21.1) and in vitro (Fig.21.2) amplification and 
purification of templates for DNA sequencing. Most 
of the protocols presented are used successfully in 
our laboratories. Table 21.1 gives an overview of the 
number of templates that can be handled in each 
protocol. 


21.2 In vivo amplification methods 


21.2.1 Plasmid templates 


21.2.1.1 Standard alkaline lysis miniprep [1-4] 

Plasmids are purified from liquid cultures that 
contain the appropriate antibiotic and have been 
inoculated with a single bacterial colony picked 
from an agar plate. Many of the currently used 
plasmid vectors (e.g. the pUC series) replicate to 
such high copy numbers that they can be purified in 
large yield from 2- to 3-ml cultures that have simply 
been grown to late log phase in standard LB 
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medium. The procedure described in Protocol 100 
works well for plasmids smaller than 15 kb. Plasmid 
DNA prepared according to this standard protocol is 
a good template for radioactive sequencing. 

For convenience, 1.5 ml or 2.0ml snap-cap micro- 
fuge tubes are used. Depending on the availability of 
microcentrifuges the procedure is performed in 
batches of 12, 18 or 24 samples at a time. The alkaline 
lysis protocol can also be adapted for the growth 
and preparation of plasmid DNA from small culture 
volumes (250-500 pl) utilizing standard 96-well 
plates. Several hundred plasmids can be prepared 
simultaneously, yielding sufficient DNA for at least 
one cycle sequencing reaction. 


21.2.1.2 Standard alkaline lysis miniprep followed by 
polyethylene glycol precipitation [5] 

Plasmid DNA prepared according to the standard 
alkaline lysis protocol still contains residual salt, 
detergent and some RNase-resistant tRNA species 
which might interfere with fluorescent sequencing 
reactions that use dye primer or dye terminator 
chemistry. In order to obtain high-quality template 
for fluorescent sequencing, the plasmid DNA 
should be further purified. We recommend an extra 
polyethylene glycol (PEG) precipitation (Protocol 
101) that effectively removes these impurities. 


templates for DNA sequencing. 


21.2.1.3 Short alkaline miniprep for plasmid DNA [4,6] 
The major difference between this method, which is 
given in Protocol 102, and the standard method 
(Protocol 100) is that no RNase digestion and 
phenol/chloroform steps are used. Alkaline lysis is 
followed by ethanol precipitation using one Vol. 
ethanol only. Prior to sequencing, sequencing 
primer is added and the plasmid DNA is denatured 
by treatment with NaOH followed by neutralization 
with HCl. The alkaline treatment also degrades the 
RNA. The miniprep DNA is an excellent template 
for radioactive sequencing using o-*S-ATP and 
Sequenase 2.0. It is not suitable for fluorescent 
sequencing using dye primer or dye terminator 
chemistry. 


21.2.1.4 QIAGEN preparation 

Qiagen is among several suppliers providing fast 
and convenient DNA purification systems for DNA 
sequencing. Two basic purification techniques can 
be used separately (1 and 2) or in combination (3). 


(1) Separation based on anion-exchange chromatography 
(QlAwell systems) DNA is a negatively charged 
biopolymer which can easily bind to a solid support 
Possessing positive charges. After washing, the 
DNA is removed from the support with simple salt 
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Table 21.1 Methods for generation of sequencing templates and their scale of application. 
ee 9 ee ee ee 


Sample number 
Topic Method Product 1-12 24-48 96 
21.2.1 In vivo 
Pal Panik Plasmid 
PAPE Standard alkaline lysis miniprep ds x = ~ 
ZAIN 72 Standard alkaline lysis miniprep + PEG ds x = = 
PAV PINES) Short alkaline miniprep ds x x = 
21.2.1.4 Qiagen method 
Anion-exchange based-purification 
(Qiagen mini/midi/maxi/mega prep 
OlAwell 8 Kit) ds M x ~ 
Silica gel-based purification (QlAprep 
Spin Kit, QIAprep 8 Kit) ds x x - 
Anion-exchange + silica gel 
purification (QlAwell Plus Kit, QlAwell 
Ultra Kit) ds - x x 
PAL Pa M13 
DPhp | Standard PEG-phenol method Ss x x - 
PDIP Detergent extraction method ss - x x 
PAN POXS) Magnetic bead purification 
(affinity capture) ss = = x 
21.2.2.4 Silica gel-based membranes (QIAprep) ss x x - 
2 PS} Phagemid 
Standard PEG-phenol method ss x x ~ 
21.2.4 Cosmid 
21.2.4.1 Anion-exchange-based purification ds x - - 
21.2.4.2 CsCl gradient ds x = - 
DIES In vitro 
20301 Asymmetric PCR (pure product) ss x x x 
Pleo Symmetric PCR (pure product) ds x x x 
DVS201 Selective PEG precipitation ds x x x 
DAES 22. Magnetic bead purification by 
affinity capture ss x x x 
Dies Symmetric PCR (product mixture) 
ZAC ool Silica gel-based purification ds x - - 
PisleG\S 7 ‘Freeze and squeeze’ method ds x - - 
2NES O20 Column purification ds x oa ~ 
21.3.3.4 Agarase method ds x - - 
ZVBBS Phenol/chloroform extraction ds x ~ - 
213 13:0 Direct sequencing of DNA in low 
melting point agarose ds x - - 


buffers followed by ethanol precipitation to remove 
the salt and to concentrate the DNA. This process is 
called anion-exchange chromatography and can be 
used to separate and purify DNA from complex 
mixtures obtained after alkaline lysis and avoids 
phenol/chloroform extractions. The effectiveness of 
the separation process depends largely on the pro- 
perties of the solid support. 

The QIAGEN support is a special silica gel-based 
resin or membrane which has an optimal particle 
size of around 100m and a special surface coating 


containing diethyl aminoethyl (DEAE) groups 
which creates a high surface charge density. It 
selectively separates DNA from substances such as 
proteins, carbohydrates and others. 

Different kits are available which allow purifi- 
cation of plasmid DNA in microgram to milligram 
amounts and contain anion-exchange resins in 
different formats: for example single column 
(QIAGEN-tip 20-10.000) or strips of eight columns 
(QIAwell 8 Plasmid Kit). 
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(2) Silica gel-based purification (QIAprep systems) 
Single- or double-stranded DNA selectively preci- 
pitates/ adsorbs to silica gel surfaces in the presence 
of high concentrations of chaotropic salts. Under 
optimized conditions, carbohydrates, RNA, and 
proteins do not adsorb and are removed by washing. 
DNA is then eluted under low salt conditions from 
the support. QIAprep Spin Plasmid Kits utilize 
microfuge spin columns and are ideal for 1-12 
minipreps. QlAprep 8 Plasmid Kits utilize 8-well 
strips for higher throughput of up to 48 minipreps. 


(3) Combined anion-exchange chromatography/silica gel- 
based purification The quality of the plasmid DNA 
can be further improved by combining anion- 
exchange separation with selective binding to silica 
gel membranes. This avoids the final ethanol preci- 
pitation step. Kits based on a combination of these 
two principles are also available in two different 
formats: strips of 8 columns (QIAwell 8 Plus Plasmid 
Kit) and 96-column array (QlAwell 96 Ultra Plasmid 
Kit). With the help of a vacuum manifold 48-96 
samples can be easily processed within 2h. 

The decision about which kit is to be used for a 
particular project depends on the number and the 
amount of template that will be needed and on the 
type of sequencing chemistry. For cycle sequencing 
with Taq polymerase much less plasmid template is 
needed than for one single sequencing reaction 
using Sequenase 2.0 or T7 polymerase (e.g. lpg of 
plasmid DNA is needed for cycle sequencing with 
ABI fluorescent dye terminators, whereas 3-5 pg are 
needed for the Pharmacia ALF sequencer if dye 
primer chemistry is used). Also, if a directed 
sequencing strategy using custom-made primers is 
adopted, several sequencing reactions with the same 
template must be performed which results in a need 
for much more template DNA. 

QIAprep plasmid kits are more suitable for small- 
scale projects based on manual sequencing with 
radioactive or chemiluminescent labelling. Good 
results can be obtained with QlAwell plasmid kits 
and fluorescent T7 DNA polymerase sequencing. 
For large-scale sequencing projects QIAwell 8 Plus 
and QlAwell 96 Ultra Kits are favoured because 
several hundred plasmid templates of high quality 
can conveniently be prepared in a short period of 
time. Typical yields are 10-20 pg DNA per sample. 
The plasmid DNA is suitable for dye primer and dye 
terminator cycle sequencing using the ABI 373A 
system. It can also be used in the Pharmacia ALF 
system. 

For protocols for these methods refer to the 
instruction manuals supplied with the kits. In order 
to achieve high yields grow bacteria under optimal 
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conditions. The use of 5 ml overnight cultures in 50- 
ml Falcon or glass tubes is recommended. Qiagen 
provides a robotic workstation designed to use the 
QlAwell 96 Ultra Plasmid kit. 


21.2.2 M13 templates 


21.2.2.1 Standard PEG—phenol method [7-10] 

The growth and purification of M13 DNA from 
small-volume cultures (2-3 ml) is a straightforward, 
rapid and easy procedure to perform. M13 phages 
do not lyse their hosts, but are released from infected 
cells as the cells continue to grow and divide. Phage 
particles are readily separated from the bacteria by 
centrifugation. After precipitation of the phage 
particles with PEG, the single-stranded DNA is 
recovered by removal of the phage protein coat with 
phenol and purified by ethanol precipitation. Small 
liquid cultures (1.5ml) can be processed in dispos- 
able polypropylene tubes using microcentrifuges 
and yield pure single-stranded DNA (=5-10pg) 
sufficient for several sequencing experiments. It is 
possible to grow and purify batches of 12, 18 or 24 
minicultures sequentially, so that up to 96 samples 
can be processed in a day. However, the handling of 
those numbers is tedious and much time is spent 
opening and closing tubes and transferring tubes in 
and out of centrifuges. There are several protocols 
for M13 minipreps in 96-well microtitre plates. The 
reader is referred to these procedures if the 
processing of several hundred templates is being 
contemplated. Protocol 103 describes a standard 
PEG-phenol method for recovery of DNA from M13 
phage. 


21.2.2.2 Detergent extraction method [11,12] 

Protocol 104 combines PEG precipitation of M13 
phage particles with a nonionic or ionic detergent 
extraction coupled with heating to denature the M13 
protein coat. The whole procedure of growing cul- 
tures, PEG precipitation, centrifugation and deter- 
gent extraction is performed in special 96-tube boxes 
allowing the easy handling of hundreds of M13 
clones per day. The M13 DNA obtained is an 
excellent template for fluorescent sequencing using 
dye primer chemistry. 

This detergent extraction method avoids phenol 
extraction and ethanol precipitation. It is therefore 
faster and less labour intensive than the standard 
PEG-phenol method (see Protocol 103). 


21.2.2.3 Magnetic bead purification [13-15] 

M13 single-stranded DNA can conveniently be 
purified for large-scale sequencing applications 
using biotinylated M13-specific oligonucleotides 
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coupled to streptavidin-coated paramagnetic beads 
(Protocol 105). First, phage supernatants are lysed. 
Second, PEG and paramagnetic beads with the oligo 
probe are added and the oligo is allowed to anneal to 
the template DNA. Then the beads are separated 
using a magnet and the supernatant is removed. 
Finally, the beads are repeatedly washed before the 
template is released from the beads by heating. 

This procedure is carried out in 96-well plates. The 
yield per sample is ~ 0.5 yg DNA which is sufficient 
for one cycle sequencing reaction. Since centrifu- 
gation is not necessary (except for the initial clearing 
of the bacterial lysates) most of the time is spent 
waiting for lysis or annealing. Therefore it is possible 
to process several plates in parallel up to 192 probes 
in 2h. 

We are currently sequencing cosmids containing 
genomic DNA by the shotgun approach. We routine- 
ly generate 1000-1500 M13 templates per cosmid 
which then are subjected to cycle sequencing with 
dye primer chemistry and analysed on 373A ABI 
sequencers. The average read length is 400 bases per 
sample. 

The magnetic bead prep procedure can also be 
adapted to robotic workstations like the Catalyst 
(Applied Biosystems) and the Biomek 1000 or 2000 
(Beckman). However, substantial input is required to 
alter hardware components and to develop reliable 
protocols. Also, throughput with these machines 
is rather limited. New hardware with higher 
throughput is presently being developed. 

Paramagnetic beads with M13-specific oligo pro- 
bes are commercially available from Promega and 
Dynal. 


21.2.2.4 Silica gel membrane (glass filter) purification 
[16-18] 

M13 phages which previously have been precipi- 
tated with acetic acid are applied to silica gel or glass 
filter membranes. Under these conditions intact 
phage particles are retained on the membrane. Upon 
addition of high concentration of chaotropic salt 
(NaClO,), single-stranded M13 DNA _ binds 
(adsorbs) to the membrane, while the phage coat 
proteins pass through and are efficiently removed. 
After washing, pure M13 DNA is eluted from the 
membrane using water or TE. 

The method described in Protocol 106 yields a 
high-quality M13 template DNA which is ideal for 
any standard manual or automated sequencing 
application. It is suitable for cycle sequencing using 
dye primer and dye terminator chemistry and the 
ABI 373A as well as for T7 or Sequenase 2.0 
extension reactions using dye primer and the 
Pharmacia ALF system. 


21.2.3 Phagemids [19-21] 


Several vectors have been developed that combine 
desirable features of both plasmids and filamentous 
bacteriophages. These are plasmids containing an 
origin of replication from a filamentous bacterio- 
phage. They have several attractive features. 
1 They provide the same stability and high yields of 
double-stranded DNA as conventional plasmids. 
2 They eliminate the tedious and time-consuming 
process of subcloning DNA fragments from plas- 
mids to bacteriophage vectors. 
3 They are small enough to accommodate segments 
of foreign DNA up to 10 kb in length that can then be 
obtained in single-stranded form. 
4 Whereas the phagemids can be treated like con- 
ventional plasmids, the isolation and purification of 
phagemid single-stranded DNA, apart from an 
initial superinfection with helper phage, is as 
straightforward as for filamentous phages. 

Protocol 107 describes the preparation of phage- 
mid DNA. 


21.2.4 Cosmids 


Cosmids can serve either as a template for sequenc- 
ing (directed sequencing using custom primers) or 
as the substrate for generation of shotgun libraries in 
M13 or plasmids. 


21.2.4.1 QIAGEN plasmid kits 

A fast and convenient method to generate large 
amounts (100yg-10mg) of cosmid DNA are 
QIAGEN midi/maxi/mega/giga plasmid kits. 
Cosmids purified by this procedure are suitable for 
all kinds of sequencing reactions. But in most cases 
the read length is considerably shorter than obtained 
with M13 or plasmid DNA as template. 

Cosmid DNA prepared by QIAGEN plasmid kits 
is often contaminated by considerable amounts of 
Escherichia. coli DNA. Therefore the use of this DNA 
is not recommended for the generation of subfrag- 
ment shotgun libraries. 


21.2.4.2 CsCl gradient purification [22] 
Pure cosmid DNA for library construction can be 
generated by one round of isopycnic centrifugation 
over a caesium chloride density gradient (Protocol 
108). Using a table top ultracentrifuge (Beckman 
TL100) this method becomes reasonably fast and 
easy. Between 5 and 50pg of supercoiled cosmid 
DNA can be obtained from a 3-ml CsCl gradient. The 
E. coli contamination of libraries derived from this 
DNA is less than 5%. 

The open circle DNA can also be collected and be 
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used for direct cosmid sequencing or generation of 
sequencing templates by PCR. 


21.3 In vitro amplification methods 


PCR has been used as a means of circumventing 
bacterial growth in the preparation of DNA 
templates. The main advantage is that the template 
preparation becomes a simple biochemical process 
that can readily be automated if required. There are 
two major strategies for sequencing of PCR pro- 
ducts: 

1 direct sequencing of PCR products without 
cloning; 

2 molecular cloning of PCR-amplified material 
prior to sequence determination. 

The direct sequence analysis of PCR products has 
the advantage that it views an entire mixture of 
amplified DNA molecules in a single assay and 
enables rapid and precise determination of sequence 
identity and variation. For example, direct sequen- 
cing of a mixture of two PCR fragments that are 
heterozygous at one or more base positions will 
reveal ambiguous signals at these positions. Thus, 
direct sequencing of PCR fragments is an attractive 
diagnostic tool for carrier analysis in basic research 
and routine diagnostics, but can also be used for 
examining mutations in disease genes. On the other 
hand, cloning of PCR products allows DNA 
sequence variants to be separated before sequen- 
cing, but a larger number of clones must be analysed 
to identify polymorphic base positions. Also, errors 
introduced by the polymerase during PCR ampli- 
fication are a serious problem if cloning is used prior 
to sequencing. Thus, direct sequencing of PCR pro- 
ducts without an additional cloning step is generally 
preferable to sequencing cloned material. In addit- 
ion to the benefit of simplicity, this greatly reduces 
the potential for errors due to imperfect PCR fidelity, 
as any random misincorporations in an individual 
template will not be detectable against the much 
greater signals of the ‘consensus’ sequence. 

In contrast to plasmid and M13 templates, PCR 
products are much more difficult to sequence. One 
problem is that the two strands of a linear double- 
stranded PCR molecule quickly reanneal, which 
prevents effective annealing and extension of the 
sequencing primer. Early protocols suggested the 
use of organic solvents like DMSO or other deter- 
gents to limit template reannealing. Rapid annealing 
protocols that involved dropping the temperature 
quickly were also tried to solve this problem. 

Several modified PCR protocols were developed 
to solve this problem in a more general way 
(Fig. 21.2). In asymmetric PCR (Section 21.3.1) the 


two primers are added to the reaction mix at 
different concentrations. Thus, after one primer is 
exhausted the exponential PCR amplification stops 
and one of the two strands undergoes further linear 
amplification. The final product mixture will contain 
sufficient single-stranded template to be used for 
sequencing. 

In symmetric PCR (Section 21.3.2) the two primers 
are added to the reaction mix at the same concen- 
tration and a double-stranded DNA molecule is 
formed during amplification. Double-stranded PCR 
products can easily be sequenced using a linear 
amplification/sequencing protocol known as cycle 
sequencing. In this method the sequencing primer is 
repeatedly annealed to one strand of the PCR 
molecule and extended by Taq polymerase after heat 
denaturing. Prior to sequencing, excess primers and 
nucleotides as well as small molecular weight 
material must be removed from the template. In 
Section 21.3.2.1 we describe a very effective and easy 
method for purifying PCR products by selective 
precipitation with PEG (Protocol 112). 

In a different strategy, one of the two strands of a 
specifically modified PCR product is removed prior 
to sequence analysis. One method uses the biotin— 
streptavidin system to affinity purify one of the 
two PCR strands prior to sequencing (Section 
21.3.2.2, Protocol 114). This is achieved by intro- 
ducing a biotin label into one PCR strand and 
bind the product to a suitable solid support coated 
with streptavidin. The free unlabelled strand is 
then removed using alkaline conditions and the 
captured strand is sequenced.Another variant is 
to employ a 5’-phosphorylated primer during 
PCR to introduce a phosphate group to the 5’ end 
of one strand. Exonuclease III is then used to ‘chew 
off’ this strand leaving the other strand for 
sequencing. 

A major problem with direct sequencing of PCR 
products, though, is that because of nonoptimal 
conditions several nonspecific products, in addition 
to the main PCR product, are often formed during 
amplification. Also, excess nucleotides and oligonu- 
cleotide primers as well as small molecular weight 
material are still present, leading to a complex 
mixture. It is important to purify the main PCR 
product away from nonspecific fragments, nucleo- 
tides and primers. The most effective method in our 
hands is the use of agarose gel electrophoresis for 
separation of the product mixture followed by 
isolation of the DNA from agarose. Agarose gel 
electrophoresis has the advantage of allowing many 
samples to be handled in parallel. It is cheap and 
widely distributed. In Sections 21.3.3.1-21.3.3.6 we 
present several protocols for the isolation of DNA 
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from agarose gels which work well in our labo- 
ratories (Protocols 114-117). 

Single-stranded PCR templates can be sequenced 
using conventional primer extension reactions with 
Sequenase 2.0 or other polymerases. Double- 
stranded PCR products are effectively sequenced 
using cycle-sequencing reactions with either 
labelled primers (radioactive or fluorescence) or 
with dye terminator chemistry. 


21.3.1 Asymmetric PCR (single-stranded 
template) [23-27] 


Asymmetric PCR utilizes an unequal, or asymmetric 
concentration of the two amplification primers 
(Protocol 109). During the initial 15-25 cycles 
double-stranded DNA is generated, but once the 
limiting primer is exhausted, single-stranded DNA 
complementary to the limiting primer is accumul- 
ated by linear amplification for the next 5-10 cycles. 
Typical primer ratios for asymmetric PCR are 50:1 
or 100:1. The single-stranded template can be 
sequenced with either the limiting primer or a 
nested primer. 

One problem with asymmetric PCR is that the 
primer ratio and the thermal cycling conditions 
must be optimized to ensure reliable generation of 
single-stranded template suitable for sequencing. 
This is a tedious process and asymmetric PCR has 
not proved reliable for routine sequencing. In a 
slight modification of the original protocol, a regular 
symmetric PCR with two primers is performed 
yielding double-stranded DNA. Then, a linear 
amplification with one of the two original primers is 
performed using a template containing 1% of the 
first reaction (Protocol 110). 


21.3.2 Symmetric PCR (pure double-stranded 
template) 


During symmetric PCR both primers are added at 
the same concentration to the reaction mix and a 
double-stranded PCR product is formed. Total 
genomic DNA, YACs, cosmids, plasmids or eukary- 
otic or bacterial cells can be used as a source for 
template DNA. In order to obtain a pure product in 
sufficient quantity PCR conditions must be carefully 
optimized. 

Pure double-stranded templates can often be 
generated from bacterial colonies or cultures (also 
from phage plaques and stocks) by symmetric PCR 
using universal primer pairs (e.g. M13-21 forward / 
M13 reverse, T3/T7, KS/SK) flanking the insertion 
region. This way plasmid libraries can easily be 
sequenced in microtitre plates. 


The major problem of generating sequencing 
templates from pure double-stranded PCR products 
is to remove excess primers and nucleotides as well 
as larger amounts of truncated amplification pro- 
ducts prior to sequencing. In Section 21.3.2.1, we 
present a method which uses selective PEG precipi- 
tation to achieve this (Protocol 112). The pure 
double-stranded template is then subjected to cycle 
sequencing. 

For diagnostic sequencing (detection of muta- 
tions, heterozygous individuals) it is often necessary 
to use Sequenase 2.0 or T7 polymerase as the 
sequencing enzyme because they produce a much 
more even peak distribution than thermostable 
DNA polymerases like AmpliTaq. In these cases the 
double-stranded PCR product must be efficiently 
denatured prior to sequencing. Traditional methods 
like alkaline treatment or heat denaturing are not 
efficient because the two PCR strands reanneal very 
quickly. Several other methods have been published 
to generate a single-strand template from a double- 
stranded PCR molecule. One method uses the 
exonuclease III from 4 to chew off one PCR strand 
which is phosphorylated at its 5’ end. The remaining 
PCR strand is then sequenced. In Section 21.3.2.2 we 
present another method of strand separation which 
is based on biotin capture of one of the PCR strands 
(Protocol 113). 


21.3.2.1 Purification of double-stranded PCR template 
by selective PEG precipitation [28,29] 

Excess PCR primers, nucleotides and truncated PCR 
products can efficiently be removed by one step 
precipitation of template with PEG. A special PEG 
mixture is used to selectively precipitate DNA of 
more than 150 bp leaving residual primers, nucleo- 
tides and small molecular weight PCR products in 
the supernatant. PEG-precipitated templates can be 
easily sequenced from both ends using radioactive 
or fluorescent cycle sequencing methods. Fluores- 
cent dye primer and dye terminator chemistries 
have been successfully used. 


21.3.2.2 Generation of single-stranded template by 
affinity capture using the biotin-streptavidin system 
[30-33] 

Ina very elegant way, biotin is attached to the 5’ end 
of one strand of the double-stranded PCR product. 
The biotin label is introduced during PCR by using 
one biotinylated primer and one unlabelled primer. 
The biotinylated PCR strand is then captured onto a 
suitable polymeric support. Paramagnetic beads 
coated with streptavidin are most suitable for this 
purpose although streptavidin-coated agarose and 
plastic surfaces (microtitre wells, pins) have also 
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been used. The captured DNA is then denatured 
using sodium hydroxide and the second noncap- 
tured strand is removed together with the excess 
primer and nucleotides by washing. Subsequently, 
the purified single-stranded template bound to the 
beads is suitable for sequencing reactions with 
Sequenase 2.0 or T7 polymerase using radioactive or 
fluorescent labels. The paramagnetic beads do not 
interfere with the sequencing reaction. 


21.3.3 Symmetrical PCR (product mixture) 


Because of nonoptimized amplification conditions 
and unexplained primer and template variability, 
the PCR product is often accompanied by one or 
several minor products. Together with excess 
primers and nucleotides these have to be removed 
prior to direct sequencing. 

The method of choice to get rid of both is to run 
the PCR product on an agarose gel, excise the band 
of interest and recover the DNA. Various procedures 
have been described for the recovery of DNA from 
agarose gels. In our hands the following procedures 
work well. 


21.3.3.1 DNA recovery by binding toa 
silica gel-based matrix 
Purification of DNA from agarose slices is based on 
solubilization of agarose with sodium iodide or 
perchlorate and selective adsorption of nucleic acids 
onto silica gel surfaces in the presence of high 
concentrations of chaotropic salts. Impurities are 
then washed away and the DNA is eluted with low- 
salt (e.g. Tris) buffer. Many well-documented, easy- 
to-use, ready-to-go kits based on this principle are 
commercially available. For protocols see product 
guides of the respective companies: 
1 GeneClean II Kit (BIO 101 Inc., La Jolla, CA, USA); 
2 Qiaex II Kit (Qiagen Inc., Chatsworth, CA, USA); 
3 InVisorb Kit (Gesellschaft fiir Biotechnik mbH, 
Berlin, Germany); 
4 SpinBind Kit (FMC Bioproducts, Rockland, ME, 
USA); 
5 Prep-A-Gene Kit (Bio-Rad Laboratories, Hercules, 
CA, USA). 

Variations of these integrate the more comfor- 
table, albeit usually more expensive, microfuge 
cartridge systems. 


21.3.3.2_ DNA recovery by the 

‘freeze and squeeze’ method 

In the freeze and squeeze method (Protocol 114), the 
agarose gel matrix is physically destroyed and the 
DNA is released together with the water of gelation. 
As the name implies, the gel slice with the DNA 


band of interest is frozen and while still frozen 
squeezed through an appropriate filter (e.g. 
Millipore’s ULTRAFREE-MC 0.45 mm filter unit) by 
centrifuging at high speed. The filter holds back the 
dry, powdered agarose. The DNA can then be 
recovered from the aqueous filtrate by simple 
ethanol precipitation. 


21.3.3.3, DNA recovery by column purification 
Purification columns allow the rapid purification of 
PCR fragments from low melting point agarose gels 
(Protocol 115). DNA is recovered from the slice of 
agarose by running the molten gel on a purification 
column containing a DNA-binding resin (e.g. 
Magic/ Wizard PCR Preps, Promega). After washing 
the column with isopropanol, DNA is eluted with 
water. The whole procedure takes about 30 min from 
start to finish and a clean DNA fragment suitable for 
radioactive cycle (‘fmole’, Promega) or fluorescent 
cycle sequencing with dye terminators (Applied 
Biosystems) is obtained. 


21.3.3.4 Agarase method [34] 

DNA can also be recovered from low melting point 
agarose slices by using an agarose-digesting enzyme 
(GELase, Cambio Ltd, Cambridge, UK or Agarase, 
Calbiochem, San Diego, USA) followed by ethanol 
precipitation of the DNA (Protocol 116). 


21.3.3.5 Phenol/chloroform extraction [35] 

DNA can easily be recovered from LMP agarose by 
repeated extraction with phenol and chloroform 
followed by ethanol precipitation. This traditional 
method was widely used in the past before DNA 
binding to a silica gel-based matrix became fash- 
ionable. The method is cheap, works very reliably 
and should also be considered for recovering 
PCR products from agarose slices for direct 
sequencing. 


21.3.3.6 Direct sequencing of DNAin LMP agarose [36] 
Protocols 114-116 and other methods described in 
Sections 3.3.1-3.3.5 use an additional step to recover 
the DNA from agarose gels. It has recently been 
shown that DNA purified in LMP agarose can be 
directly sequenced in molten agarose (Protocol 117). 
Therefore, the PCR product can be prepared for 
sequencing in a single step and sequenced under the 
same conditions as a normal double-stranded 
template using techniques already available in most 
laboratories. DNA purified according to Protocol 
117 can be sequenced using Sequenase 2.0 and 
AmpliTaq. Single primer extension reactions and 
cycle sequencing protocols can be applied. 
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Protocol 100 


Standard alkaline lysis miniprep of plasmid DNA 


F 


or details of solutions, media and materials, see Appendix |. For 


suppliers and contact addresses see Appendix III. 


Materials 


GET buffer: 50 mm glucose, 25 mm Tris-HCl (pH 8.0), 10 mu EDTA 
lysis buffer: 0.2m NaOH, 1% SDS 

RNase A (200 ug mI") 

potassium acetate (KOAc) (3 m) 

sodium acetate (NaOAc) (3 mM, pH 5.2) 

phenol 

chloroform 


Method 


1 Spin for 30s to pellet the cells from 1.5 ml overnight culture. 


Resuspend cells in 100 pl GET buffer. 


2 Add 200 ul freshly prepared lysis buffer, mix gently by inverting a few 


times (do not vortex), leave on ice for 5min. 


3 Add 150ul 3m KOAc solution, mix gently by inverting, leave on ice 


for 5 min. Spin for 5 min to pellet the chromosomal DNA and cell 
debris. Transfer 400 ul supernatant to a clean tube. 


4 Add 1 ml absolute ethanol and mix by vortexing. Leave at room 


temperature for 5 min to precipitate the DNA. 


5 Pellet the DNA by centrifugation for 5 min at room temperature. 


Decant supernatant and wash pellet once with 1 ml 70% ethanol. Dry 
briefly under vacuum. 


6 Resuspend the DNA pellet in 100 ul RNase A (200 pg mI) and 


incubate at 37°C for 1h. 


7 Extract once with 50 ul phenol followed by 50 ul chloroform. Transfer 


aqueous phase to a new tube and precipitate DNA by addition of = 
vol. 3M NaOAc (pH 5.2) and 2 vol. ethanol. Incubate for 5 min at room 
temperature. 


8 Pellet the plasmid DNA by centrifugation for 15 min at room 


temperature. Wash once with 1 ml 70% ethanol, and briefly dry the 
DNA under vacuum. Dissolve pellet in 100 pl water. 
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Protocol 101 


Protocol 102 


PEG precipitation of plasmid DNA 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix III. 


Materials 


e PEG solution: PEG 8000 (26.2%), MgCL, (6.6 mm), NaOAc (0.6m, 
pH 5.2) 


Method 


1 Precipitate DNA obtained by Protocol 100 (step 8) with 100 pl PEG 
solution. Mix thoroughly and incubate at room temperature for 
5 min. 


2 Pellet the plasmid DNA by centrifugation for 15 min at room 
temperature. Wash once with 1 ml 70% ethanol, and briefly dry the 
DNA under vacuum. Dissolve pellet in 100 pl water. 


Short alkaline miniprep for DNA 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix III. 


Materials 


¢ GET buffer (see Protocol 100) 
e lysis buffer (see Protocol 100) 
¢ KOAc (3m) 


Materials 


1 Decant 2 ml overnight culture into a 2-ml snap-cap tube. Spin for 30s 
to pellet the cells and resuspend pellet in 200 pl GET buffer. 


2 Add 400 ul freshly prepared lysis buffer, mix gently by inverting a few 
times (do not vortex), leave on ice for 5 min. 


3 Add 300 ul KOAc solution, mix gently by inverting, leave on ice for 
5 min. Spin for 5 min to pellet chromosomal DNA and cell debris. 
Transfer 800 ul supernatant to a clean 2-ml snap-cap tube. 


4 Add 1 vol. (900 ul) absolute ethanol and mix by vortexing and 
immediately pellet the DNA by centrifugation for 5 min at room 
temperature. Decant supernatant and wash pellet once with 2 ml 
70% ethanol. Dry briefly under vacuum. 
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Protocol 103 


5 Resuspend the DNA pellet in 40 pl sterile water. Two microlitres of 
the DNA are then used for sequencing. Add sequencing primer, 3 pl 
0.4m NaOH and incubate at 65 °C for 5 min. Neutralize with 3 pl 0.4m 
HCl. 


Standard PEG-phenol method for 
recovery of DNA from M13 phage 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix III. 


Materials 


e TY medium 

PEG/NaCl solution (20%/2.5 mu) 
e TE buffer (10 mm) 

¢ NaOAc (3, PH 5.2) 

e phenol 

e ethanol 


Method 


1 Toothpick a white plaque into 1.5 ml of a 1: 100 dilution of an 
overnight culture of TG1 cells in TY medium. Shake at 300r.p.m. for 
5-6 h at 37 °C. 


2 Transfer the culture to a 1.5-ml microfuge tube and spin for 5 min at 
top speed. Transfer supernatant into a fresh 1.5-ml microfuge tube 
containing 100 ul 20% PEG/2.5m NaCl solution without disturbing the 
cell pellet. Vortex well and incubate at room temperature for 10 min. 


3 Spin for 10 min to pellet the phage particle. Remove carefully all of 
the supernatant by aspiration. Spin again for 1 min and aspirate off 
supernatant to remove traces of PEG. 


4 Resuspend phage pellet in 100 yp! 10mm TE buffer. Add phenol, 
vortex thoroughly, and incubate at room temperature for 5 min. 
Vortex again and spin for 2 min to separate the phases. Transfer 60 pl 
of the aqueous phase, avoiding the interphase, into fresh 1.5-ml 
microfuge tube containing 6 pl 3m NaOAc (pH 5.2) and 150 pl 
ethanol. Vortex thoroughly and precipitate DNA at —20 °C. 


5 Pellet the M13 DNA by centrifugation for 10 min at room 
temperature. Decant supernatant and wash pellet once with 1.0 ml 
70% ethanol. Dry briefly under vacuum. 


6 Resuspend DNA in 10-50 ul sterile water to a final concentration of 
100 yg mI". 
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Protocol 104 


Detergent extraction method for 
M13 DNA (for 96 samples) 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix III. 


Materials 


° TY medium 

e PEG/NaCl solution (20%/2.5 m) 

e Triton-TE extraction buffer: 0.5% Triton X-100, 10 mm Tris-HCl 
(pH 8.0), 1mm EDTA (pH 8.0), or 

e jonic extraction buffer:, e.g. Tris-HCI (pH 8.0), 1mm EDTA, 125 mm KI 
(potassium iodide), 0.16 mm KDS (potassium lauryl sulphate) 

e 96-tube boxes (Beckman) 

e 96-tube cap (Beckman) 

e 1.2-ml centrifuge tubes (Beckman) or 

© 96 deep well microtitre plate (Beckman) 

© rotor #362349 (Beckman) 

e 3Msilver tape (R.S. Hughes Company) 

° 96-well microtitre plate (Corning) 


Method 


1 Use a 96-tube box that holds strips of 12 1.2-ml tubes. A 96 deep-well 
microtitre plate can also be used. Add 800 ul of a 1: 100 dilution of an 
overnight culture of JM101 or TG1 in TY medium. 


2 Clear plaques are picked using toothpicks which are dropped into 
single tubes of the 96-tube box. When each tube contains a 
toothpick, all are removed and discarded. The box is incubated with 
lid taped on securely for 12-16 h at 37 °C and 300r.p.m. 


3 Centrifuge the 96-tube box at 3500r.p.m. for 15 min using a special 
rotor (# 362349, Beckman). 


4 Prepare a second 96-tube box by placing 120 pl 20% PEG/2.5 mu NaCl 
solution into each tube using a 12-channel pipetter. Transfer 600 ul of 
phage supernatant from first 96-tube box to the PEG-containing tube 
box. A 96-tube cap is placed over all tubes and the box is inverted 
several times to mix. Incubate for 15 min at room temperature. 


5 Centrifuge the 96-tube box at 3500r.p.m. for 15 min, and decant the 
supernatant by inverting the box over a sink. Leave the box in an 
inverted position on a paper towel for 2 min. 


6 To completely remove the PEG from the tube walls, a paper towel is 
placed underneath the lid of the box and the entire box is 
centrifuged in an inverted position at 200-250 r.p.m. for 2 min. 


7 Phage pellets are resuspended by adding 20 ul nonionic Triton-TE 
extraction buffer to each tube. (In a modified protocol, an ionic 


543 CHAPTER 21 TEMPLATE AMPLIFICATION AND PURIFICATION 


Protocol 105 


extraction buffer, e.g. Tris-HCl (9H 8.0), 1 mm EDTA, 125 mm KI, 

0.16 mm KDS is used.) All tubes are covered with a piece of 3M silver 
tape and vortexed on a floor model for 30-45 s with pulsing between 
vortex speed levels 2 and 6. Centrifuge briefly to collect phage 
solution. 


8 Remove bottom of the 96-tube box to check all tubes for complete 
resuspension of the phage pellets. Place box in water bath at 80 °C 
(for non-ionic extraction buffer) or at 90 °C (ionic extraction buffer) 
and incubate for 10 min to achieve phage lysis. Place 96-tube box into 
a 4 °Cice slurry for 5 min. Centrifuge briefly to collect condensate. 


9 Remove silver tape and transfer resulting solution to 96-well 
microtitre plate. Add 20-40 pI water to each well to dilute DNA. Store 
at =207.G. 


eoecce SPOSSHSHHHEHSOSHHSHHSHOHHHHHHSHHHHHHHHHOHEHHETHHHASHHOEHSESH SOOTHES ESESETESSODESEOEO®D 


Magnetic bead purification of 
M13 DNA (for 96 samples) and bead re-use 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix III. 


Materials 


e TY medium 

PEG/NaCl (20%/2.5 m) 

15% SDS 

©40:1 SSG 

e streptavidin-coated paramagnetic beads with linked biotinylated 
M13-specific oligos (DYNAL) 

e 2-ml microfuge tubes (Eppendorf) 

¢ 96-well microtitre plate (Falcon) 

e 96-well magnet 

e multipipette with 0.5-ml adaptor (Eppendorf) 

e 12-channel pipette 


Method 


1 Pick 96 white plaques into 2-ml sterile microfuge tubes without lids 
containing 1 ml of a 1: 100 dilution of an overnight culture of TG1 
cells in TY medium. Place tubes in plastic racks (Eppendorf; each 
rack holds 10 tubes) and shake for 5-6 h at 37 °C. 


2 Prior to sample preparation wash beads as follows: take 4 ml of 
paramagnetic beads with M13-specific oligo (10 mg mI"), wash 
beads three times in 4ml water and resuspend in 1 ml sterile water. 
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Add 10 ul washed beads to each well of the microtitre plate using a 
multipipette with a 0.5-ml adaptor. 


3 Spin racks with tubes from step 1 at top speed for 5 min (Eppendorf 
centrifuge 5416 or 5413). 


4 Transfer 190 pl supernatant to each well of a microtitre plate. 


5 Add 10 pl 15% SDS to each sample using a multipipette with a 0.5- 
ml adaptor and put plate on heat block at 80 °C for 5 min. 


6 Add 50ul PEG/NaCI solution per well and put microtitre plate on 
heat block at 45 °C for 20 min. 


7 Collect beads by placing microtitre plate on a 96-well magnet. 
Aspirate off supernatants from all wells. 


8 Wash beads three times with 100ul 0.1 xSSC and mix by pipetting. 
Collect beads with magnet. Aspirate off supernatants from all wells. 


9 Add 20ul water to each well, place microtitre plate on heat block at 
80 °C for 3min to release M13 DNA from beads. 


10 Collect beads by placing microtitre plate on a 96-well magnet. 
Transfer DNA to new microtitre plate using 12-channel pipette. 


Note: Paramagnetic beads can be re-used. After washing, old beads 
are combined with new beads. 


BEADS WORK-UP PROTOCOL 


Additional materials 


® re-use solution: 0.15 mM NaOH, 0.001% Tween 20 
© 50-ml Falcon tubes 
¢ 10xPBS, 0.01% BSA 


Method 
1 Add 50ul re-use solution to each well of the microtitre plate. 


2 Pool beads in 50-ml Falcon tube. Wash beads twice with an equal 
volume of re-use solution. Collect beads with magnet. 


3 Wash beads once with an equal volume of 10 x PBS, 0.01% BSA, 
collect beads with magnet and resuspend beads in half the original 
volume of 10x PBS, 0.01% BSA. 


4 Mix used beads with new beads ina 1:1 ratio. 
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Protocol 106 = Silica gel membrane (or glass filter) purification of 
M13 DNA (24 samples) 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix Ill. 


Materials 


e TY medium 

e chaotropic salt (NaClO,) 
e TE buffer (0.1m) 

e acetic acid 

¢ 70% ethanol 

e 1.5-ml microfuge tubes 
e Whatman GF/C filter 

e filtration unit 


Method 


1 Toothpick a white plaque into 1.5 ml of a 1: 100 dilution of an 
overnight culture of TG1 cells in TY medium. Shake for 5-6 h at 37 °C. 


2 Transfer the culture to a 1.5-ml microfuge tube and spin for 5 min at 
top speed. Transfer supernatant into a fresh 1.5-ml microfuge tube 
containing 15 ul acetic acid. 


3 Put a Whatman GF/C filter into a spin-X filter carrier and place the 
spin-X filter carriers into a filtration unit. Add the supernatant from 
step 2 onto the Whatman GF/C filter using gentle suction. The 
precipitated phages will stick onto the glass filter. 


4 Add 1ml 4m NaCclO, in TE to the filter. 
5 Wash the filter with 1 ml 70% ethanol and dry the filter for 5 min. 


6 Place filter onto a 1.5 ml-microfuge tube. Add 20 ul 0.1m TE to the 
filter and spin tube for 30s in a centrifuge. 


Following this protocol 24 samples can be processed within 30 min. 
Alternatively, Qiagen offers commercial kits for M13 DNA preparation 
based on the same principle in two formats: microspin columns 
(QIAprep Spin M13 Kit), and 8-well strips (QIAprep 8 M13 Kit). Up to 48 
samples can be processed in parallel in less than 30 min using a vacuum 
manifold and a multichannel pipette. A QlAprep 96 M13 Kit is also 
available. 
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Protocol 107 


Protocol 108 


Preparation of phagemid DNA 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix III. 


Materials 


¢ pBluescript (Stratagene) 

phage M13KO7 (or VCSM13, Stratagene) 
° TY medium 

e kanamycin 

¢ PEG/NaCl solution (20 %/2.5 m) 

e TE buffer (10 mm) 

® NaOAc (3, pH 5.2) 

¢ phenol 

° ethanol 


Method 


1 Suspend a fresh bacterial colony containing a phagemid (e.g. 
pBluescript) in a sterile 15-ml culture tube with 2-3 ml 2xTY medium 
containing the appropriate antibiotic. Add M13KO7 (or its commercial 
derivative VCSM13 from Stratagene) to a final concentration of 
2x10’ PFU mI". Incubate for 1-1.5h at 37 °C with strong agitation. 


2 Add kanamycin to a final concentration of 70 ug mi. Continue 
incubation for a further 4-5 h at 37 °C. 


3 Prepare single-stranded DNA as described in Protocol 103, step 2. 


CsCl gradient purification of cosmid DNA 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix lll. 


Materials 


¢ TY medium 

* GET buffer (see Protocol 100) 

¢ TE buffer (10mm) 

¢ 10mm Tris-HCI (PH 8.0), 0.1 mu EDTA 
° 0.2mMNaOH/1% SDS 

¢ 3MNaOAc, (pH 5.2) 

* 96% ethanol, 70% ethanol 

© CsCl 

° ethidium bromide (10 mg ml“) 

* isobutanol 

polyallomer Quick Seal tubes (Beckman) 
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e 500-ml centrifuge bottles (Sorvall) 
¢ 30-ml centrifuge bottles 
¢ 50-ml centrifuge tubes 


Method 


1 
2 


10 
11 


12 
13 


14 


15 


16 


Streak cosmid directly from a—70 °C stock to obtain single colonies. 


Inoculate 4x 200 ml TY containing the appropriate antibiotic with a 
single colony each (avoid very tiny or large colonies). Shake at 37 °C 
for 18-24h. 


Spin at 10000 g for 5 min in 500-ml Sorvall centrifuge bottles. 
Decant the medium and place the bottles on ice for 2 min. Aspirate 
off any remaining medium and add 6 ml GET buffer. 


Resuspend pellet and transfer with a 10-ml pipette to a 30-ml 
centrifuge bottle. 


Add 8 ml 0.2m NaOH/1% SDS, mix by inverting and incubate on ice 
for 15min. 


Add 6 ml 3M NaOAc (pH 5.2), mix by inverting and incubate on ice 
for 30 min. 


Spin at 17000 g for 10 min and transfer the supernatant (15-18 ml) 
to a 50-ml tube. 


Add 2 vols 96% ethanol, mix and spin at 900g for 5min. Decant the 
supernatant and drain. 


Wash the pellet with 20 ml 70% ethanol, spin at 6000 g for 5 min, 
decant the supernatant, drain for 5min and vacuum dry for 2h. 


Add 2.5 ml 10mm TE buffer and allow DNA to dissolve overnight. 


Add 2.9g CsCl and 0.25 ml 10 mg mi" ethidium bromide, mix and 
spin at 6000 g for 10 min. Transfer the supernatant to polyallomer 
Quick Seal tubes. Ensure the tubes are full, then seal and place ina 
Beckman TL 100.3 rotor with spacers. 


Spin at 20°C at 70000r.p.m. for 17h or at 83 000 r.p.m. for 6h. 


Collect the cosmid DNA with a 20-G needle on a 1-ml syringe by 
piercing the wall of the tube about 2 mm below the lower 
supercoiled cosmid band. Collecting 200 ul yields about 90% of the 
cosmid. Do not try to collect more. 


Add 300 ul water and extract with isobutanol until the organic layer 
is colourless. 


Add water to a final volume of 400 ul and precipitate DNA with 
800 pl 96% ethanol. 


Dissolve cosmid DNA in 20 pl 10 mm Tris-HCl (pH 8.0), 0.1 mm EDTA. 


Cosmid clones tend to be unstable, therefore it is recommended to 
pick several colonies for independent large-scale growth. The different 
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preparations should be checked for rearrangements by restriction 
enzyme digestion or fingerprinting. 


Generation of single-stranded DNA sequencing 
template by asymmetric PCR (one-step) 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix Ill. 


Materials 


° 10mm Tris-HCl (pH 8.0), 0.1 mu EDTA 

° 1XPCR buffer: 10 mm Tris-HCl (pH 8.3), 50mm KCI, 1.5mm MgCl, 
dNTPs (250 um each) 

universal forward primer 1 (0.5 um) 

universal reverse primer 2 (0.01 um) 

° Taq polymerase (Amplitagq, 1.0 units, Perkin Elmer) 

e 3M NaOAc, pH5.2 

® isopropanol 

¢ 70% ethanol 

e 0.5-ml microfuge tubes 

e PCR tubes 

¢ light mineral oil 

materials and equipment for agarose gel electrophoresis 


Method 


1 Pick a fresh phage plaque with the tip of a pasteur pipette into 100 pl 
of 10 mm Tris-HCl (pH 8.0), 0.1 mm EDTA. 


2 Prepare a mix for 24, 48 or 96 asymmetric PCR reactions (or multiples) 
containing 1 PCR buffer, universal forward primer 1 (0.5 um), 
universal reverse primer 2 (0.01 um), and Taq polymerase (AmpliTaq, 
1.0 units). Dispense 95 pl of the asymmetric PCR mix into 0.5-ml 
microfuge tubes using a multichannel pipette. Transfer 5 ul culture 
(phage stock) into PCR tubes. All reactions are then overlaid with 
100 ul light mineral oil and asymmetric PCR is performed for 35 cycles 
(a typical cycle is 30s at 95 °C, 30s at 50-55 °C and 1-2 min at 72 °C). 


3 A5-ul aliquot of the PCR product is examined by agarose gel 
electrophoresis. Single-stranded DNA runs slower than double- 
stranded DNA and can be visualized by staining with ethidium 
bromide, although the fluorescence is much reduced relative to an 
equivalent amount of double-stranded DNA. 


4 The reaction mixture is carefully removed from under the mineral oil 
and transferred to a clean 1.5-ml microfuge tube. 10 yl NaOAc and 
100 ul isopropanol are added, the mixture is vortexed and the DNA 
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precipitated by incubation at room temperature for 10 min. DNA is 
pelleted by centrifugation at 13 000g for 10 min, washed once with 
400 ul 70% ethanol and briefly dried. The DNA is dissolved in 25 ul 
water. 


Protocol 110 Generation of single-stranded DNA sequencing 
template by asymmetric PCR (two-step) 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix Ill. 


Materials 


© as Protocol 109 


Method 
1 Perform symmetric PCR as described in steps 1-2 of Protocol 111. 


2 A5-ul sample of the reaction mixture is removed from under the 
mineral oil and transferred to a new tube (or microtitre plate) 
containing 45 ul water. 


3 Alinear amplification mix for 24, 48 or 96 PCR reactions is prepared 
containing 1xPCR buffer, one universal primer (0.5 um) and Taq 
polymerase (AmpliTagq, 0.5 units). Dispense 50-100 ul of the PCR mix 
into 0.5-ml microfuge tubes using a multichannel pipette. Transfer 
1-2 ul diluted double-stranded PCR product from step 2 into new 
tube. All reactions are then overlaid with 50-100 ul light mineral oil 
and linear amplification is performed for 35 cycles (a typical cycle is 
30s at 95 °C, 30s at 50-55 °C and 1-2 min at 72 °C). 


4 A5-ul aliquot of the PCR product is examined by agarose 
electrophoresis. 


5 The reaction mixture is carefully removed from under the mineral oil 
and transferred to a clean 1.5-ml microfuge tube. 10 yl NaOAc and 
100 pl isopropanol are added, the mixture is vortexed, and the DNA 
precipitated by incubation at room temperature for 10 min. DNA is 
pelleted by centrifugation at 13000 g for 10 min, washed once with 
400 pl 70% ethanol and briefly dried. The DNA is dissolved in 25 ul 
water. 
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Generation of double-stranded DNA sequencing 
template by symmetric PCR (for 24, 48 or 96 
templates) 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix Ill. 


Materials 


TB broth or 2xTY broth 

PCR buffer (see Protocol 109) 

dNTPs (250 ul each) 

universal primers (0.3 pl each) 

Taq polymerase (AmpliTaq, Perkin Elmer) 

microtitre plate (Corning) 

0.25-ml microfuge tubes or heat-stable microtitre plate (Techne) 
96-pin hedgehog device 


Method 


1 Recombinant colonies are toothpicked into separate wells of a 


microtitre plate containing 100 ul of TB or 2x TY broth with the 
appropriate antibiotic. The plates are incubated with lids on at 37 °C 
for 12-24h without shaking. Culture microtitre plates are stored at 
4°C for several weeks until sequencing is finished. Replica plates 
containing glycerol are stored at —70 °C. 


2 Prepare a mix for 24, 48 or 96 PCR reactions (or multiples of this) 


containing 1xPCR buffer, universal primers (0.3 um each) and Taq 
polymerase (0.5 units per well). Dispense 20-30 ul of the PCR mix into 
0.25-ml microfuge tubes or into wells of a heat-stable microtitre 
plate using a multichannel pipette. Transfer a small amount of 
culture (0.5 pl) into PCR tubes. A 96-pin hedgehog device is used for 
simultaneous transfer of many samples from the culture plate to the 
PCR plate. All reactions are then overlaid with 20 ul of light mineral 
oil and PCR is performed in a thermal cycler (e.g. MW-1, PHC-3, 
Techne). After an initial denaturation period at 95 °C for 150 s, in 
order to free some template DNA, 35 cycles are carried out including 
denaturing at 95 °C for 30s, annealing at 50-55 °C for 30s and 
extension at 72 °C for 1-2 min (depending on insert size). 


A 5-ul aliquot of the PCR product is examined by agarose gel 


electrophoresis to estimate insert size and check purity of PCR 
product. 
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Protocol 113 


Purification of PCR-generated double-stranded DNA 
sequencing template by selective polyethylene glycol 
(PEG) precipitation 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix III. 


Materials 


e¢ PEG mix: 26.2% PEG 8000, 6.6 mm MgCl,, 0.6m NaOAc (pH 5.2) 
e ethanol 


Method 


1 A large portion of the aqueous phase of the PCR product from step 2 
of Protocol 111 is transferred to 0.5-ml microfuge tubes containing 
an equal volume of PEG mix. It is mixed thoroughly and then 
incubated at room temperature for 5 min. The PEG mix is dispensed in 
advance using a multiple pipetter, e.g. Eopendorf 4780. 


2 Spin the samples at 13000 g for 5min and remove supernatant 
carefully with a yellow tip, avoiding the usually invisible DNA pellet. 


3 Wash pellets once with ethanol, dry tubes and redissolve DNA in 
water. 
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Generation of single-stranded DNA sequencing 
template by PCR followed by affinity capture 
using the biotin-streptavidin system 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix Ill. 


Materials 


e streptavidin-coated magnetic beads (Dynabeads M-280, Dynal) 
e materials for PCR (see Protocol 111) 

¢ biotinylated universal primer 1 

e unbiotinylated universal primer 2 

¢ 0.15m NaOH 


Method 


1 PCR amplification is performed under the usual conditions (see 
Protocol 111) but using 0.3 um biotinylated universal primer 1 and 
0.3 um unbiotinylated universal primer 2. 


2 For each amplified template, 30 pl of washed streptavidin-coated 
magnetic beads are added directly to the reaction tube or microtitre 
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well including the oil overlay. The biotinylated PCR product is 
allowed to bind to the beads by incubating at room temperature for 
10 min. 


3 Beads are pelleted towards the side of the tube or well using a 
suitable magnet and the supernatant is removed. 


4 Denature captured DNA by adding 20 up! 0.15 m NaOH and incubate 
the beads for 5 min at room temperature. Beads are pelleted again 
and the supernatant is removed. 


5 Wash the beads once with 20 ul 0.15 m NaOH, followed by three 
washes with 40 pl water. During each wash pellet the beads with the 
magnet and remove supernatant. 


6 Beads are resuspended in water and used for sequencing reaction. 


Recovery of PCR product by the 
‘freeze and squeeze’ method 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix III. 


Materials 


¢ Millipore ULTRAFREE-MC 0.45 mm microfuge cartridge 
° 3MNaOAc, pH5.2 
e absolute ethanol, 70% ethanol 


Method 
1 Excise DNA band of interest with a clean razor from agarose gel. 


2 Stuff the slice into the filter unit of Millipore ULTRAFREE-MC 
microfuge cartridge. 


3 To freeze the agarose slice, put the assembled cartridge (filter unit 
with slice in the provided microfuge tube) at -70 °C for 15 min or at 
-—20 °C for 30 min. 


4 Immediately (while the agarose is still frozen) spin at 13 000g for 
5 min. 


5 Discard the filter unit with the dry powdered agarose. To the 
aqueous filtrate add 0.1 vol. NaOAc and 3 vols cold absolute ethanol 
to precipitate the DNA. Incubate at room temperature for 5-10 min, 
spin at 13000 g for 5 min, discard supernatant, wash once with 70% 
ethanol, air dry and dissolve DNA in an appropriate volume of water. 


A cheaper way of doing this is to punch asmall hole at the bottom of a 
0.5-ml tube and to stuff the bottom of the tube with siliconized glass 
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wool to hold back the agarose. This can then be inserted into a 1.5-ml 
tube and be used instead of the filter unit. 

The quality of DNA obtained by the freeze and squeeze method can 
be improved by subjecting the filtrate to a phenol/chloroform 
extraction prior to ethanol precipitation or by doing a PEG precipitation 
(see Protocol 101) instead of the ethanol precipitation. 
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Protocol 115 Recovery of PCR product by column purification 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix III. 


Materials 


° low-melting-point (LMP) agarose gel (SeaPlaque or NuSieve GTG 
agarose, FCM Bioproducts) 

ethidium bromide 

¢ Magic PCR Preps resin (Promega) 

¢ 80% isopropanol 

¢ equipment for electrophoresis 

e 1.5-ml microfuge tubes 

¢ Magic minicolumn (Promega) 


Method 


1 Separate the PCR reaction products by electrophoresis in an LMP 
agarose gel containing ethidium bromide using standard 
procedures. 


2 Visualize band under long wavelength UV light and excise the 
desired DNA band using a clean razor blade. 


3 Transfer agarose slice to a 1.5-ml microfuge tube and incubate 
sample at 70 °C until agarose is completely molten. 


4 Add 1 ml Magic PCR Preps resin to the molten agarose slice and 
vortex for 20s, then leave on bench for 5 min for DNA to bind to the 
resin. 


5 For each PCR product, prepare one Magic Minicolumn. Remove and 
set aside the plunger from a 3-ml disposable syringe. Attach the 
syringe barrel to the luer-lock extension of each minicolumn. 


6 Pipette the resin/DNA mix from step 4 into the syringe barrel. Insert 
the syringe plunger slowly, and gently push the slurry into the 
minicolumn with the plunger. 


7 Detach syringe from minicolumn, and remove the plunger from the 
syringe. Re-attach the syringe barrel to the minicolumn. Pipette 2 ml 
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80% isopropanol into the syringe to wash the column. Insert the 
plunger into the syringe and gently push the isopropanol through 
the column. 


8 Remove the syringe and transfer minicolumn to a 1.5-ml 
microcentrifuge tube. Centrifuge the minicolumn for 20s at 
13 000 g to dry the resin. 


9 Transfer minicolumn into new microfuge tube. Apply 50 ul water to 
the column and wait 3-5 min. Centrifuge column for 20s at 13 000g 
to elute the bound DNA. 


10 Repeat step 9 with another 50 ul water. 


11 The amount recovered is estimated by running + of each fragment 
on an agarose gel. Approximately 30 ng double-stranded DNA of 
about 250-400 bp is used per ‘fmole’ cycle sequencing reaction and 
500 ng for cycle sequencing with dye terminators. 


The use of a vacuum manifold allows for the processing of 20 
fragments at a time and the whole procedure takes less than 1h. 


Recovery of PCR product by the agarose method 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix III. 


Materials 


° 10mm Tris-HCl (pH 7.6), 5mm EDTA (pH 8), 0.1m NaCl 
° GELase (Cambio Ltd) 

® ethanol 

® 3MNaOAc 


Method 


1 Incubate a gel segment of low-melting-point (LMP) agarose 
containing the DNA of interest at room temperature for 30 min in 20 
vols 10 mm Tris-HCl (pH 7.6), 5mm EDTA (pH 8.0) and 0.1m NaCl. 


2 Remove excess buffer carefully, transfer the gel segment to a clean 
tube and incubate at 70 °C until gel is completely molten. 


3 Equilibrate the molten gel carefully to 45 °C. Centrifuge tube briefly 
to collect all the material in the bottom of the tube. 


4 Add 1 unit GELase per 600 mg of 1% agarose gel and incubate at 
45°C for 1h. During this time the agarose is digested to 
oligosaccharides. 
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5 DNA is further purified by ethanol precipitation in the presence of 
3m NaOAc, washed once with 70% ethanol, dried and resuspended 
in a suitable volume of water for sequencing. 


The LMP agarose gel must be completely molten for the agarose 
bonds to be accessible to GELase. 

GELase is rapidly inactivated at temperatures above 45 °C, while LMP 
agarose begins to resolidify below 45 °C. 


Direct sequencing of DNA in 
low-melting-point agarose 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix Ill. 


Materials 


¢ low-melting-point (LMP) agarose (SeaPlaque or NuSieve GTG 
agarose, FCM Bioproducts) 
¢ 1xTBE/0.5mg mi" EDTA 


Method 
1 Prepare LMP agarose gels in 1x TBE containing 0.5mg ml" EDTA. 


2 Mix 20 ul aliquots of the PCR reaction mixture with 4 ul of loading 
dye. Load samples in individual lanes of the gel. Run gels long 
enough to yield well-resolved bands. Excise bands using a razor 
under long-wavelength UV light. It is important to trim away all 
excess agarose from the product band. 


3 Melt the DNA-containing gel slice by heating for 5 min at 68 °C. Use 
10 pl DNA for sequencing. 


One-half to 2.0% agarose gels can be used and are compatible with 
the method. SeaPlaque agarose should be used for separation of 
fragments smaller than 1 kb and NuSieve for bands greater than 1 kb. 


2 Sambrook, J., Fritsch, E.F. & Maniatis, T. (1989) Small- 


References scale preparations of plasmid DNA. In Molecular 
1 Birnboim, H.C. & Dolly, J. (1979) A rapid alkaline Cloning: A Laboratory Manual, 1.25-1.39 (2nd edn, Cold 
extraction procedure for screening recombinant plas- Spring Harbor Laboratory Press, Cold Spring Harbor, 


mid DNA. Nucleic Acids Res. 7, 1513-1523. NY). 
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22.1 Introduction 


Determination of the sequence of DNA is one of the 
most important aspects of modern molecular bio- 
logy. Since the development in the late 1970s of the 
currently used DNA sequencing techniques, di- 
deoxy chain termination [1] and chemical degra- 
dation [2], newer and faster methods have been 
devised and developed. Now, over 150 million bases 
are available, stored in international data banks such 
as EMBL, GenBank, and the DDBJ, making them 
fully accessible to the scientific community. Many 
viral genomes have been completely sequenced, 
including the genome of cytomegalovirus, which 
consists of 200 000 bp [3]. The complete sequences of 
the bacterium Escherichia coli (see Chapter 31), and 
the yeast Saccharomyces cerevisiae (see refs 3-5 and 
Chapter 30) are now available. Projects are also 
under way to determine the DNA sequence of the 
fruitfly (Drosophila melanogaster) (Chapter 28), the 
mouse (Mus musculus) (Chapter 26), the nematode 
worm (Caenorhabditis elegans) (Chapter 29), a plant 
(Arabidopsis thaliana) (Chapter 33), and the human 
(Homo sapiens) genome. 

The two original methods of DNA sequencing 
that were described in 1977 differ considerably in 
principle. The enzymatic (or dideoxy chain termi- 
nation) method of Sanger ef al. [1] involves the 
synthesis of a DNA strand from single-stranded 
template by a DNA polymerase. The Maxam and 
Gilbert (or chemical degradation) method [2] 
involves chemical degradation of the original DNA. 
Both methods produce populations of radioactively 
labelled polynucleotides that begin from a fixed 
point and terminate at points dependent on the 
location of a particular base in the original strand. 
The fragment sets of the four reactions are loaded on 
adjacent lanes on a polyacrylamide slab gel and 
resolved by polyacrylamide gel electrophoresis. 
Autoradiographic imaging of the pattern of the 
labelled DNA bands in the gel reveals the relative 
sizes, corresponding to band mobilities, of the 
fragments in each line, and the DNA sequence is 
deduced from this pattern [6]. Using the high- 
resolution denaturing polyacrylamide electrophore- 
sis procedures, one can resolve single-stranded 
oligodeoxynucleotides of up to 700 bases long. 

Although both these techniques are still employed 
today, there have been many modifications and 
improvements to the original protocols. The chemi- 
cal degradation method is still in use, but the 
enzymatic chain termination method is by far the 
most established and widely used technique for 
sequence determination. Fundamental to dideoxy 
sequencing are three major tools: 


e DNA polymerases; 
e deoxynucleotide analogues; 
¢ polyacrylamide gel electrophoresis (PAGE) [7]. 

The most common enzymes currently in use are 
the Klenow fragment of DNA polymerase I [1], 
modified T7 DNA polymerase [8], and the ther- 
mophilic DNA polymerase (Jaq polymerase) of 
Thermophilus aquaticus [9]. The usually used Klenow 
enzyme has been increasingly replaced by the T7 
and Taq DNA polymerases. The basic chemistry of 
deoxynucleotides has remained unchanged, but 
new analogues, including deoxyinosine (dITP) [10] 
and the 7-deaza analogues, c7dATP and c7dGTP 
[11], have been synthesized for special purposes 
in sequencing. There have been fewer changes in 
PAGE technique compared to the DNA poly- 
merases; however, the introduction of thin gels [12], 
wedge-shaped gels [13], and buffer gradient gels 
[14] has significantly improved the data obtainable 
from PAGE both in quality and quantity. 

The largest number of bases resolved from a gel 
run of a single sequencing reaction is around 
500-750. For longer stretches to be analysed, addi- 
tional strategies have to be pursued. For example, 
‘walking’ primers are used for sequence lengths 
between 500 and 2000 bp. Between 2000 and 500bp, 
an ordered deletion method like the Bal31 exonu- 
clease approach [15] or the unidirectional exonu- 
clease III method [16] is employed. Both methods 
involve a controlled progressive degradation of 
parts of the DNA insert from a fixed point. For DNA 
fragments longer than 5000 bp, the random shotgun 
approach is used [17-19]. Using this approach, the 
DNA insert is enzymatically or sonically cut at non- 
specific points into smaller fragments and sub- 
cloned into a plasmid vector suitable for performing 
sequencing reactions. The data obtained from the 
individual sequencing reactions are compiled with 
the aid of a computer, resulting in the final DNA 
sequence. 

Another point to be adressed in sequencing is the 
type of DNA being used as template (see Chapter 
21). In general, there are different sources of DNA in 
use. Both the single-stranded M13 phage [20] and 
phagemids, which are plasmids containing the 
intergenic region of M13 which can be packed into 
single-stranded form using helper phage [21], are 
widely used. Double-stranded DNA from circular 
plasmids, cosmids, or linear phage lambda DNA can 
also be used [13]. Finally, the products of polymerase 
chain reaction (PCR) are routinely used as DNA 
templates for sequencing [22] 

To automate the process, the use of hazardous 
and expensive radioisotopes has been eliminated 
and replaced by fluorescence methods; a chemistry 
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has been devised for attaching DNA molecules 
to fluorophores, which can be detected by laser 
technology. 

Additional recent innovations include the use of 
PCR technology to enable the sequencing reaction to 
be used repetitively to generate a sequencing ladder 
[23,24]. A recently developed detection method that 
is comparable in sensitivity to traditional radiola- 
belling is sequencing by chemiluminescence, in 
which detection of the sequencing products occurs 
by a chemiluminescent reaction that can be moni- 
tored by autoradiography [25,26]. 

Another innovative approach is multiplex 
sequencing, which uses hybridization to a specific 
probe to detect an individual sequencing ladder in a 
mixture of ladders (see Chapter 20). In this method, 
DNA samples are mixed and sequenced using either 
the enzymatic or the chemical method, and the 
products are fractionated on a sequencing gel which 
is transferred to a membrane, and hybridized with a 
probe specific for one template, a process which can 
be repeated up to 40 times employing different 
probes. Thus, the amount of sequence information 
available from one gel can be multiplied by the 
number of times the membrane can be rehybridized 
[27]. Multiplex sequencing originally used radio- 
active probes and chemical sequencing technology 
[27], but can be extended beyond the Church 
approach to include chemiluminescent label instead 
of a radioisotope [25] and to use Sanger dideoxy 
chemistry instead of Maxam-Gilbert [28]. 

Another recent innovation that is applicable to 
both manual and automated DNA sequencing is 
solid-phase DNA sequencing [29,30]. In_ this 
procedure one strand of a double-stranded DNA 
molecule is biotinylated. The hemibiotinylated 
DNA molecule is then bound to streptavidin— 
ferromagnetic beads. Sequencing reactions can be 
performed using the denatured biotinylated strand 
preparation as the template. 

Commercial efforts are being made to automate 
parts of the sequencing process, from sample 
preparation through DNA analysis, by using robotic 
work stations [31,32]. All commercially available 
automated sequencers are designed for enzymatic 
sequencing reactions with manual gel preparation 
and sample loading, but with automatically con- 
trolled electrophoresis and data analysis. 

In future, employing mass spectrometry or tun- 
nelling electron microscopy for DNA sequencing 
might dramatically increase the rate at which DNA 
can be sequenced. 


22.2 Chemical DNA sequencing 
(Maxam-Gilbert method) 


22.2.1 Overview 


In the chemical method of DNA sequencing devel- 
oped by Maxam and Gilbert [2], the target DNA is 
radioactively labelled at one end (3’- or 5’-end). This 
label is the reference point for determining the 
positions of the nitrogenous bases. In four separate 
reactions, the labelled DNA is cut with a base- 
specific chemical reagent under limiting conditions 
and the reaction products are separated on a se- 
quencing gel. Because only end-labelled fragments 
are observed following autoradiography of the 
sequencing gel, the DNA sequence can be read from 
the four DNA ladders. The sequencing reaction 
consists of two stages: 

1 the chemical modification step, which is carried 
out in such a way that only one base of one type, 
such as guanine, is modified once in every 500-1000 
bases; 

2 chain cleavage at the modification sites, which is 
taken to completion. 

The modification reactions use very toxic and 
fairly unstable reagents (dimethyl sulphate, hydra- 
zine, potassium permanganate, etc.), which is one of 
the reasons for the rapid decrease in the popularity 
of this method. Another reason is the development 
of simple and improved methods for enzymatic 
DNA sequencing based on the Sanger method [1]. 

Although the chemical procedure for DNA 
sequencing is not as widely used as the Sanger 
method [1], it has some advantages and can be very 
useful in certain situations. Sequencing can be 
performed from any point in the clone where a 
suitable restriction site occurs, obviating the need 
for further subcloning. The sequence thus obtained 
can be used to design oligonucleotide primers, and 
further sequence can then be obtained by Sanger 
sequencing. The Maxam-Gilbert method is also 
very useful for resolving regions of DNA that yield 
poor results in Sanger sequencing owing to second- 
ary structures in the DNA [32]. 


22.2.2 Chemical sequencing in practice 


Chemical sequencing requires a DNA fragment that 
is labelled at only one end. It is a disadvantage, but a 
prerequisite for chemical DNA sequencing to have a 
detailed restriction map for the fragment to be 
sequenced. This knowledge permits the generation 
of a series of subfragments that can be enzymatically 
labelled at both ends. These labelled fragments have 
to be cut asymmetrically to produce two fragments 
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labelled at each end, which can be purified by gel 
electrophoresis. But, with the aid of specially con- 
structed vectors [33,34] the requirement for purify- 
ing and labelling individual restriction fragments 
can be circumvented, a fact which greatly facilitates 
chemical sequencing projects. These vectors enable 
DNA that is labelled at only one end to be sequenc- 
ed. This strategy is analogous to the universal 
priming site in enzymatic dideoxy sequencing. 
These vectors allow the cloning of a set of nested 
deletions, generated, for example, by the unidirec- 
tional Bal31 deletion strategy. This nested deletion 
strategy is to be recommended for large chemical 
sequencing projects [33]. 


22.2.3 Vectors for chemical sequencing 


Recently, the construction of specialized vectors for 
rapid and simultaneous sequence analysis of a large 
number of samples by chemical sequencing has 
been described [33,34]. These vectors, pSP64CS and 
pSP65CS, are high-copy-number plasmids with a 
synthetic polylinker containing two Tth111I sites 
flanking a Smal site. The two Tth111I sites cleaves at 
the sequence GACNNNGTC leaving a single 
protruding 5’ base. The way the two Tth111I sites are 
devised makes it possible to selectively label one 
end of a Tthl11I fragment using the Klenow 
fragment of the DNA polymerase I in an end- 
labelling reaction. The labelled fragment can be 
sequenced directly, without prior gel purification. 
The use of these vectors has several advantages. 
First, they allow for rapid and simultaneous 
sequencing of a large number of samples by the 
chemical cleaving method. Second, the enzymatic 
end-labelling step is simply and easily to perform, 
Third, owing to the high copy number of these 
plasmids, a small plasmid preparation yields 
enough DNA to repeat the sequencing reactions 
several times, once the DNA is radiolabelled. 
Fourth, the vectors are versatile, and allow for the 
sequencing of a large number of fragments bordered 
by any of the unique restriction sites in the 
polylinker. Fifth, each plasmid contains the SP6 
promoter element which can be utilized to syn- 
thesize RNA complementary to the cloned DNA. 
This feature can be useful in exon/intron mapping 
or other types of nuclease protection/ mapping 
experiments, which can be carried out simultane- 


ously with sequencing using the same plasmid 
[33,34]. 


22.2.4 Direct DNA sequencing of PCR products 


PCR amplification has enormously simplified and 


accelerated the analysis of DNA that is only 
available in small amounts. For example, mutations 
in genes of patients suffering from genetic diseases 
can be diagnosed routinely by PCR. Various 
methods of detecting mutations in PCR-amplified 
products have been described, including allele- 
specific oligonucleotide hybridization [36], restric- 
tion enzyme digestion, sequencing of subcloned 
PCR products, transcript sequencing of amplified 
DNA [37], and direct DNA sequencing using single- 
stranded or double-stranded PCR products employ- 
ing the chain termination method of Sanger et al. [1]. 
But direct sequencing of single or double-stranded 
DNA by the enzymatic approach gives often poor 
resolution, especially if shorter DNA templates are 
used. 

During their study of mutations in patients with 
propionic acidemia, Kraus and Tahara [35] devel- 
oped a protocol that allows the direct sequencing of 
PCR amplified genomic DNA by the Maxam-— 
Gilbert method. This procedure yields clearly read- 
able DNA sequences, 100-400 bp in length, derived 
from human genomic DNA, in 4 days. Essentially, 
the procedure can be divided into four steps: 

1 the radiolabelling of primers with [y-*P]ATP prior 
to amplification; 

2 PCR, which generates radiolabelled amplified 
DNA; 

3 removal of excess primers by spin dialysis; 

4 chemical cleavage of the PCR product with some 
modifications. 


22.2.5 Outlook 


The original method of chemical DNA sequencing 
[2] has been modified and improved over the years 
[38]. Additional chemical cleavage reactions have 
been devised [39], new end-labelling techniques 
developed [34], and shorter, simplified protocols 
have been described [40,41]. The main advantage of 
chemical degradation sequencing is that the 
sequence is obtained from the original DNA mole- 
cule and not from an enzymatic copy, in which 
wrong bases could have been incorporated. There- 
fore, with this method it is possible to analyse DNA 
modifications such as methylation, and to study 
protein-DNA interactions. Being confronted with 
strong secondary structures, which cannot be 
resolved by enzymatic sequencing, the chemical 
approach will be the method of choice. There are 
several advantages of the Maxam-Gilbert sequenc- 
ing method for analysis of PCR products [38]. The 
first is its consistency, yielding clear DNA sequences 
of fragments 100-400 bp long, whereas either single 
or double-stranded sequencing by the chain termi- 


561 CHAPTER 22 SEQUENCING CHEMISTRIES 


nation methods often yields unreadable sequences. 
Second, it allows examination of all sequences 
present in the PCR-amplified products. Finally, 
because the sequence starts with the first (5’) 
nucleotide of the primer, there is no loss of sequence 
information immediately adjacent to the primer and 
no modifications to the sequencing protocol are 
required to obtain these data. 


22.3 Enzymatic DNA sequencing 
(Sanger method) 


22.3.1 Overview 


In the conventional dideoxy sequencing reaction, an 
oligonucleotide primer is annealed to a single- 
stranded DNA template and extended by Escherichia 
coli DNA polymerase I to synthesize a complemen- 
tary copy of a single-stranded DNA in the presence 
of four deoxyribonucleoside triphosphates (dNTPs), 
one of which is 35S-labelled. DNA polymerases are 
not able to initiate DNA chains. Therefore, chain 
elongation occurs at the 3’-end of a short com- 
plementary primer which is annealed adjacent to the 
DNA segment to be sequenced. Chain growth 
involves the formation of a phosphodiester bridge 
between the 3’-hydroxyl group at the growing end 
of the primer and the 5’-phosphate group of the 
incorporated deoxynucleotide. Thus, overall chain 
growth is in the 5’—3’ direction. The reaction 
mixture contains one of four dideoxyribonucleoside 
triphosphates (ddNTPs) that terminate elongation 
when incorporated into the growing DNA chain. 
The enzymatic sequencing method is based on the 
ability of DNA polymerases to use both 2’- 
deoxynucleotides and 2’,3’-dideoxynucleotides as 
substrates. When a dideoxynucleotide is incorpo- 
rated at the 3’-end of the growing primer chain, the 
elongation is terminated selectively at A, C, G or T 
owing to the missing 3’-hydroxyl group of the 
primer chain. After completion of the sequencing 
reactions, the products are subjected to electro- 
phoresis on a high-resolution denaturing poly- 
acrylamide gel and then autoradiographed to 
visualize the DNA sequence. 

Since intact E. coli DNA polymerase I also has 5’ 
3’ exonuclease activity, the large fragment (Klenow 
fragment) of E. coli DNA polymerase I, which can 
still carry out the elongation reaction, has histori- 
cally been used. Alternatively, reverse transcriptase 
(either from Moloney murine leukaemia virus or 
from avian myeloblastosis virus) can also be 
employed. In addition, the use of T7 bacteriophage 
DNA polymerase (Sequenase) [8] has improved and 
simplified DNA sequence analysis both in respect to 


quality of resolution and quantity of data obtain- 
able. The use of Taq DNA polymerase from the 
thermophilic bacterium T. aquaticus, which has a 
temperature optimum for polymerization of 75- 
80°C, is particularly advantageous for sequencing 
DNA templates that exhibit strong secondary 
structures at lower temperatures [42], since a higher 
reaction temperature will disrupt DNA secondary 
structures and inhibit reannealing of denatured, 
double-stranded templates. 

Another variation to conventional DNA sequenc- 
ing involves substitution of dITP or 7-deaza-dGTP 
for dGTP in the nucleotide mixes to destabilize 
secondary structures that can otherwise form in the 
sequencing products during electrophoresis to 
cause gel ‘compressions’. In addition, manganese 
can be substituted for magnesium in the labelling / 
termination reaction. Manganese increases the band 
uniformity exhibited by Sequenase and can increase 
the intensity of the sequencing ladder near the 
primer. 

In practice, enzymatic DNA sequencing involves 
the following steps. 

1 The DNA to be sequenced is prepared as single- 
stranded molecules. 

2 A short, chemically synthesized oligonucleotide 
primer is annealed to the 3’-end of the region to be 
sequenced. The annealed oligonucleotide serves as 
primer for DNA polymerase. 

3 The hybrid molecules are divided into four 
aliquots. Each contains all four dNTPs, one of which 
is 35S-labelled, and also contains one of the four 
2’,3’-ddNTPs. The dNTP/ddNTP concentration is 
adjusted such that termination of the elongation 
primer occurs at each base in the template resulting 
in a population of radiolabelled extended primer 
chains, which have a fixed 5’-end determined by the 
annealed primer and a variable 3’-end terminating 
at a specific base. 

4 The radiolabelled reaction products are denatured 
by heating and then separated on a sequencing gel 
in adjacent lanes. The DNA sequence can be read 
directly from the autoradiograph of the gel. 

Most protocols for enzymatic DNA sequencing 
utilize [o-35S]dATP to label the nascent chain. 
Alternative protocols include the use of [a-*P]d ATP 
or [a-P]dCTP or [a-%P]dATP and the use of primer 
radiolabelled at the 5’-end for the sequencing 
reaction. 

Most enzymatically based sequencing studies 
have used bacteriophage M13 DNA as cloning 
vector; this replicates as a double-stranded DNA 
molecule, but is packaged as single-stranded DNA 
in the virus [43]. This method, however, usually 
requires the subcloning of DNA fragments from 
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plasmids into the double-stranded M13 replicating 
form, in which the stability of some DNA inserts 
has, at times, been problematic [7]. To eliminate 
subcloning steps, Wallace et al. [44] introduced a 
procedure using linearized plasmid DNA as tem- 
plate for sequencing. In this method, sequencing 
primers were synthesized to align chain elongation 
next to the vector cloning site and were hybridized 
to heat-denatured DNA. Chen and Seeburg [7] 
demonstrated that alkaline denaturation of super- 
coiled plasmid DNA is more efficient than heat 
denaturation. 

DNA sequencing of alkali-denatured supercoiled 
plasmid DNA can produce sequencing gels of the 
same quality as those obtained with M13 vectors. 
The past few years have witnessed several efforts to 
make the direct sequencing of plasmid DNA feasible 
by the chain termination method. One significant 
advantage of plasmid sequencing over the M13 
system is that production of unidirectional overlap- 
ping deletions and the DNA sequencing can be 
performed on a single vector, avoiding subdividing 
steps into M13. A recent vector development, 
combining the characteristics both of the M13 phage 
and the plasmids, concerns the so-called phagemids 
[21]. This vector system is widely used for cloning, 
mutagenesis and DNA sequencing. 

More recent innovations in sequencing chemis- 
tries include the development of nonradioactive 
detection methods and the automation of some of 
the sequencing stages [45]. The use of methods that 
replace radiolabelled DNA in sequencing reactions 
with fluorescently labelled DNA is growing rapidly. 
The use of fluorescent, rather than radioactive, 
material has the advantages of greater safety, less 
expensive waste disposal, generation of machine- 
readable data, and greater reagent stability. Detec- 
tion of fluorescently labelled material has the 
disadvantage of being rather less sensitive than 
detection of radiolabelled materials and requires 
expensive equipment. 


22.3.2 Vectors for dideoxy sequencing 


Dideoxy sequencing requires a single-stranded 
template to which the primer can anneal. Single- 
stranded templates can be easily generated using 
specialized vectors derived from M13 [43]. Dideoxy 
sequencing can also be readily carried out using 
double-stranded DNA, which has to be denatured 
by heat [44] or alkali [7] prior to the sequencing 
reaction. Dideoxy sequencing of a double-stranded 
template is the only rapid method available for 
verifying a particular plasmid construction, but can 
also be used for large-scale sequencing projects [33]. 


The products of PCR can also be sequenced by the 
dideoxy method. 


22.3.2.1 Filamentous phages 

The dideoxy sequencing method has been greatly 
facilitated by the development of the filamentous E. 
coli phage M13 as a cloning vector [43]. Analysis of 
the life cycle of phages M13, f1, and fd revealed that 
they have a biological system which enables them 
to separate the strands of a DNA molecule [46]. 
Only one particular strand of the double-stranded 
replicative DNA is packaged into the viral capsid 
and secreted into the culture fluid from the infected 
cells. Phage-infected cells grow more slowly than 
uninfected ones, but do not lyse. Thus, cells infected 
with these phages can be grown as normal colonies 
and as plaques, which are defined as regions of 
slowed growth on a continuous lawn of uninfected 
bacteria. 

The M13 vectors most widely used for dideoxy 
sequencing are a series called M13mp constructed 
by J. Messing et al. [43,47-49]. The M13mp series 
contains the lacZ promoter and a partial lacZ gene, 
encoding the a-fragment of B-galactosidase. After 
infection of an E. coli Fhost containing another partial 
lacZ gene encoding the w-fragment of B-galacto- 
sidase and induction by isopropy]-B-D-thiogalacto- 
side (IPTG), M13mp phage produce blue plaques on 
Xgal-containing agar plates. In addition, each vector 
in the M13mp series contains a synthetic polylinker 
inserted into the fifth codon of lacZ without 
changing the lacZ reading frame, enabling them to 
produce blue plaques on Xgal agar. Insertion of a 
DNA fragment into one of the unique polylinker 
cloning sites has a high probability of disrupting the 
lacZ reading frame, thus generating a recombinant 
phage producing colourless plaques. Because the 
polylinker is inserted into the same site in lacZ in all 
Mi3mp derivatives, a synthetic oligonucleotide 
primer, which is complementary to a region of lacZ 
adjacent to the 3’ side of the polylinker, is used as a 
‘universal’ primer for all sequencing reactions. 

Vectors based on filamentous phages have a major 
advantage, because they produce a high yield of 
recombinant single-stranded DNA from a 1ml 
volume of bacterial culture. Isolation of hundreds of 
DNA preparations from M13 clones can be done 
by hand or using automated equipment. However, 
the size of DNA fragments to be inserted into the 
vector should be limited to about 2000 bp, because 
phages harbouring large foreign DNA fragments 
grow poorly and have a tendency to undergo 
deletions [13]. Routinely, one can sequence 350-500 
nucleotides in M13 vectors in a single set of 
reactions. 
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22.3.2.2 Phagemids 

Plasmids containing the origin of replication from 
filamentous phages can be packaged into the phage 
capsid when the bacterial culture is superinfected 
with a phage. A mixture of single-stranded helper 
phage DNA and single-stranded plasmid DNA can 
thus be isolated from the culture [50]. The principle 
of the phagemid system is that it combines the 
properties of plasmid and M13 vectors in a single 
vector. However, when the helper phage and the 
plasmid replicate simultaneously, the latter appears 
to the phage as a defective interfering particle [50], 
which reduces the DNA yields of both the helper 
phage and the single-stranded form of the plasmid 
in stationary cultures. The unpredictability of the 
ratio in yield between recombinant phagemid DNA 
and the helper phage DNA in the DNA prepared 
from phage particles is a problem with using 
recombinant phagemids. But in recent innovations, 
new defective helper phages M13K07 and M13R408 
have been developed [51,52], which provide almost 
no competition with empty vectors, since more than 
95% of the yield is single-stranded vector DNA. 
This ratio can drop when recombinant DNA is 
packaged. 


22.3.2.3 Plasmid sequencing vectors 

Numerous plasmid vectors, most available commer- 
cially, can be used for double-stranded dideoxy 
sequencing [33]. Most of these plasmids contain an 
origin of replication derived from plasmids pMB1 or 
ColE1 [53,54], modified to weaken copy number 
control. These plasmids allow a_blue-vs.-white 
screen for inserts, and for most of them, primers are 
commercially available for sequencing both strands 
of insert DNA. The design of the polylinker is the 
most important feature of the commercially 
available vectors. The polylinker of pUC18/19 [49] 
is found in many of these vectors. Several other 
vectors, including Bluescript, have polylinkers 
containing several useful sites not present in the 
pUC18/19 polylinker, as well as a different configur- 
ation of these sites. To determine which plasmid 
vector is best suited for a particular sequencing 
project, one should check the catalogues of the many 
commercial suppliers (see Appendix III). 

One of the disadvantages of plasmid sequencing 
is that template renaturation during the polymerase 
reaction occurs relatively rapidly. But this drawback 
can be greatly compensated for by using fast-acting 
polymerases such as the phage T7 polymerase 
(Sequenase) or the appropriate DNA polymerases 
derived from thermophilic organisms, such as Taq 
polymerase. Using these enzymes much better re- 
sults can be obtained in plasmid sequencing than 


when using the Klenow fragment or reverse tran- 
scriptase. 


22.3.3 Basic techniques of enzymatic 
DNA sequencing 


22.3.3.1 Difference between labelling/termination and 
the Sanger procedure 

Labelling/termination sequencing reactions are 
carried out in two steps. In the first step, the primer 
is extended in the presence of low concentrations of 
dNTPs, including [a-*S]dATP, until one or more of 
the dNTP pools is depleted (labelling step). The 
limiting levels of dNTPs and a low reaction temper- 
ature reduce the processivity of Sequenase, increas- 
ing the number of chains that are extended and 
labelled. At the end of the labelling reaction, the 
uniformly labelled fragments range in size from a 
few to several hundred nucleotides. In the second 
step, synthesis resumes in the presence of additional 
dNTPs and ddNTPs (termination step). In the 
termination reaction, the high dNTP concentration 
and an increased reaction temperature render Se- 
quenase processive, ensuring that the polymerase 
extends each chain without dissociation until the 
incorporation of a dideoxynucleotide [33]. The 
labelling /termination procedure is mostly used for 
Sequenase, but can also be applied to other poly- 
merases. However, because each polymerase has 
different buffer and Mg** concentration optimum, 
and each discriminates to a different extent against 
ddNTPs, the concentration of these components 
must be modified in each case. 

Two steps are also employed in the Sanger 
dideoxy sequencing reaction, but the purposes of 
the steps are different. In the first step, a pulse 
reaction extends the primer in the presence of 
[a-=S]dATP, unlabelled dNTPs, and ddNTPs. Thus, 
both labelling and termination occurs in the pulse. 
The second step, a chase, employs a high concen- 
tration of all four dNTPs and ensures that all 
extended primers that have not incorporated a 
ddNMP are extended past the region to be 
sequenced. 


22.3.3.2 Sequencing with Klenow fragment 

The traditional Sanger method is based on two 
steps. First, on the primed DNA synthesis, which 
occurs in the presence of a mixture of [a-**S]dATP, 
unlabelled dNTPs and ddNTPs and in which 
termination takes place when a dideoxynucleotide is 
incorporated into the growing chain. Thus, both 
labelling and termination occur in the pulse. Second, 
a chase with high concentrations of dNTPs makes 
sure that oligonucleotides that have not specifically 
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terminated by incorporation of a dideoxynucleotide 
are elongated past the region to be sequenced [33]. 
The Klenow fragment has a 3’-5’ exonuclease 
activity, which shows a significantly lower activity 
than that described for the native T7 DNA 
polymerase [55] and does not interfere with the use 
of Klenow fragment for DNA sequencing. The 5’3’ 
exonuclease activity present in the native E. coli 
DNA polymerase Lis absent from Klenow fragment. 
The Klenow enzyme is relatively non-processive 
and has an intermediate elongation rate. The low 
processivity contributes to a somewhat higher 
background and lower signal-to-noise ratio than 
that found in reactions performed using Sequenase. 
In some cases, Klenow fragment has difficulties in 
reinitiating DNA synthesis at particular sites, 
resulting in ‘ghost’ or ‘shadow’ bands. Klenow 
fragment discriminates strongly against ddNTPs, 
having a several thousandfold preference for a 
dNTP over the corresponding ddNTP. In addition, it 
exhibits a sequence dependent variability in dis- 


crimination. Although the discrimination against ° 


ddNTPs is reduced over 100-fold by inclusion of 
Mn” into the reaction mix [56], a significant degree 
of sequence-dependent discrimination persists, pro- 
ducing bands that are still more variable in intensity 
than those resulting from reactions with Sequenase. 
Nucleotide analogues such as 7-deaza-dGTP and 
dITP can be used in sequencing reactions with 
Klenow fra gment [33]. 

In the labelling/termination protocol, extension 
lengths can be modulated by altering the dNTP 
concentrations in the first step. Thus, this method 
can produce longer products, on average, than the 
Sanger protocol. This is an advantage when trying to 
maximize the amount of sequence information 
obtained from each template. It can be a disad- 
vantage, however, when only the first few nucleo- 
tides of sequence information after the primer are 
desired. 

For most sequencing projects, where maximizing 
the amount of sequence information obtained per 
template is desired, the labelling /termination 
approach is recommended. For situations where 
limited amounts of sequence information are 
required, the Sanger protocol is appropriate. 


22.3.3.3 Sequencing with Sequenase 

The labelling /termination sequencing protocol 
involves two steps [55,57]. In the labelling step, 
primed DNA synthesis is initiated in the presence of 
limiting concentrations of all four dNTPs, including 
[a -*S]dATP, and continues until one of the dNTP 
pools is depleted. At this point, the uniformly label- 
led DNA chains have a random length distribution 


ranging from a few nucleotides to hundreds of 
nucleotides. In the second step, synthesis resumes in 
the presence of additional dNTPs and one ddNTP. 
Elongation of the DNA chains in this step is rapid 
and processive until termination occurs at specific 
bases after incorporation of the corresponding 
dideoxynucleotide. The average length of the radio- 
actively labelled oligonucleotide products can be 
modified by altering the concentration of dNTPs in 
the first step. It can also be regulated by altering the 
dNTP/ddNTP ratio in the termination reaction. 

The high level of 3’5’ exonuclease activity in the 
native form of T7 DNA polymerase makes it 
ineffective for DNA sequencing. However, the 
exonuclease activity can be selectively removed, 
without affecting the polymerase activity, by a 
chemical reaction [55], or by genetically based 
modification [56]. The genetically modified enzyme 
has a higher specific activity than the chemically 
modified enzyme and the resulting dideoxynu- 
cleotide-terminated fragments are more stable. T7 
DNA polymerase does not have a 5’-93’ exonuclease 
activity. 

Sequenase synthesizes DNA at a rapid elongation 
rate by a highly processive mechanism. Under the 
termination reaction conditions, it discriminates 
against ddNTPs by about fourfold compared to 
dNTPs. Thus, although Sequenase produces bands 
of higher uniformity than those produced by 
Klenow fragment or Taq DNA polymerase, there is 
about a 10-fold variation in the intensity of adjacent 
bands. When Mn* is included in the reaction mix- 
ture, Sequenase incorporates dideoxynucleotides 
at the same rate as deoxynucleotides and the bands 
are almost completely uniform [57]. 

Sequenase utilizes dITP and 7-deaza-dGTP effi- 
ciently. It has a tendency to stall in sequencing reac- 
tions using dITP, resulting in sequencing ladders 
with bands in all four lanes. These terminated 
products are described to be removable by em- 
ploying terminal deoxynucleotidyl transferase [58]. 
Labelling reactions using modified T7 DNA 
polymerase should be kept below 25°C, both to 
reduce the processivity of the enzyme and to 
maintain its activity in the termination reaction. 
However, termination reactions can be performed 
fromig7*G taba °G 


22.3.3.4 Sequencing with Taq DNA polymerase 

Taq DNA polymerase can be used in DNA 
sequencing employing both the termination/ 
labelling procedure and the Sanger protocol [33]. 
The use of Tag DNA polymerase is indicated for 
templates exhibiting secondary structures that may 
inhibit elongation of polymerase. Using Taq poly- 
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merase, the reactions can be performed at tempera- 
tures high enough to destabilize many secondary 
structures. 

Taq DNA polymerase lacks a 3’-5’ exonuclease 
activity. Native Taq DNA polymerase has 5’->3’ 
exonuclease activity; however, a number of geneti- 
cally engineered or post-translationally modified 
versions of Taq DNA polymerase have recently been 
made commercially available in which the 5’3’ 
exonuclease activity has been removed. Taq DNA 
polymerase synthesizes DNA at an intermediate 
elongation rate with a moderate degree of proces- 
sivity. It discriminates strongly against ddNTPs 
and requires a ddNTP/dNTP ratio comparable to 
Klenow fragment. It exhibits better band uniformity 
than does Klenow fragment, but not as good as that 
exhibited by Sequenase. Sequencing reactions 
carried out with Taq DNA polymerase give a very 
clean background on the sequencing gel. For total 
breakage of potential secondary structure in the 
template to be sequenced, 7-deaza-dGTP or 7-deaza- 
dATP can each be used in sequencing reactions with 
Taq DNA polymerase. However, dITP is usually not 
recommended for use with Tag DNA polymerase 
because of an unacceptable high frequency of 
inappropriately terminated chains. The sequencing 
reaction using Tag DNA polymerase can be per- 
formed at 55-70°C in low salt buffer, conditions 
which destabilize many template secondary struc- 
tures [9,33,59]. 

A variety of other thermophilic DNA polymerases 
are being developed for use in sequencing reactions, 
including Bst DNA polymerase [60], Tth DNA 
polymerase [61], and Vent DNA polymerase [62]. 


22.3.3.5 Sequencing with reverse transcriptase 

Avian myeloblastosis virus (AMV) reverse tran- 
scriptase is a RNA-dependent DNA polymerase that 
uses single-stranded RNA or DNA as a template to 
synthesize the complementary strand. This poly- 
merase can be used for synthesizing long cDNA 
molecules and for generating high-quality chain 
termination sequence data. AMV reverse transcrip- 
tase provides excellent sequencing band resolution, 
but it is not widely used for DNA sequencing. 

This enzyme lacks both 3’5’ and 5’—3’ exonu- 
clease activities. Compared to the other polymerases 
used for sequencing, it has an intermediate level of 
processivity and a low rate of elongation, which is 
disadvantageous for DNA sequencing since there is 
not as much incorporation of radioactivity into the 
fragments. In addition, background bands due to 
sites where the polymerase has paused are more 
frequent. AMV reverse transcriptase uses ddNTPs 
efficiently and requires ddNTP/dNTP ratios com- 


parable to Sequenase. It exhibits slightly better band 
uniformity than does Klenow fragment. 7-deaza- 
dGTP can be used in sequencing reactions with 
AMV reverse transcriptase. Sequencing reactions 
can be performed at 37-42 °C [33,63,64]. 


22.3.3.6 Sequencing with 5’-end-labelled primers 
Primers labelled on the 5’-end are used primarily for 
sequencing very long double-stranded DNA frag- 
ments or for templates that have given ambiguous 
results with nascent chain labelling. Nicks in DNA 
templates can act as priming sites in the labelling 
reaction, resulting in a high background on the 
sequencing gel. This can be eliminated when a 5’- 
end-labelled primer is used; only clear products 
from elongation of the primer will be detected by 
autoradiography. 

Because the primer is labelled prior to the 
sequencing reaction, only one step is required for 
extension of the primer in the presence of ddNTPs. 
Nucleotide mixes vary depending on the DNA 
polymerase used in the reaction. Sequenase, Klenow 
fragment and thermophilic DNA polymerases can 
be used for the one-step protocol as long as the 
optimum reaction conditions for the individual 
enzyme are chosen. Sequencing ladders derived 
from 5’-end-labelled primers generally have a clean 
background and band uniformity which is limited 
by variations caused by polymerase-specific arte- 
facts. Primers may be 5’-end-labelled either with 
[y-P]ATP or [y-°S]ATP using T4 polynucleotide 
kinase [33]. 


22.3.3.7 Choosing between chemical and 

enzymatic DNA sequencing 

There is no doubt that dideoxy DNA sequencing is 
simpler and quicker than chemical sequencing. A 
large number of single- or double-stranded samples 
can be prepared for sequencing simultaneously. The 
method also offers excellent band resolution if *S- 
labelled nucleoside triphosphates are used in label- 
ling the DNA. The primer-annealing and sequenc- 
ing reactions can be completed within an hour. 
Therefore, the enzymatic approach has become 
used more frequently for DNA sequencing. But it 
also has disadvantages, which relate to the property 
of the various DNA polymerases sometimes to 
terminate chain elongation prematurely owing to 
secondary structures in the templates to be sequenc- 
ed. Although the use of various DNA polymerases 
with their different properties, and the application 
of various nucleotide analogues may help in over- 
coming some of these problems, there often remain 
DNA regions which are poorly resolved by the chain 
termination method. 
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Problems associated with polymerase-mediated 
synthesis of chain elongation may be eliminated by 
utilizing the chemical approach. Employing chemi- 
cal sequencing, premature termination due to DNA 
sequence or structure does not occur, permitting 
sequencing of DNA stretches which cannot be 
sequenced enzymatically. 

To obtain the sequence of short stretches of DNA 
using the chemical method, there is no need to 
subclone into an appropriate sequencing vector. In 
addition, sequencing of small oligonucleotides is 
only possible by chemical degradation. 


22.3.3.8 Choosing a DNA polymerase 

for DNA sequencing 

The decision about which DNA polymerase to use 
for a sequencing project should be based on in- 
dividual enzyme characteristics. In general, more 
than one polymerase can give reliable sequence 
information when used in a good protocol with 
clean DNA templates. Since any of the DNA 
polymerases can produce artefactual sequence data 
under certain conditions, an effective strategy will 
be to choose one enzyme to generate the bulk of the 
sequence data and then switch to another enzyme 
and/or protocol to resolve remaining ambiguities. 

For large sequencing projects, Sequenase with the 
labelling/termination protocol is the method of 
choice because of its high degree of band uniformity, 
low background, and efficient use of ddNTPs and 
other nucleotide analogues. Because of Sequenase’s 
lack of thermostability, one should switch to Taq 
DNA polymerase in situations where regions with 
secondary structure have to be resolved. 

Klenow fragment can be used both in the 
labelling /termination and the Sanger protocol. It is 
the most widely used DNA polymerase in the 
Sanger protocol and has a long track record of use in 
DNA sequencing, but its popularity is dropping 
significantly. Taq DNA polymerase is the alternative 
of choice when template secondary structure causes 
Premature termination of Klenow fragment. Tem- 
plate secondary structure most commonly occurs in 
stretches of DNA rich in G+C or A+T or which are 
extensively palindromic. Reverse transcriptase has 
been shown to be effective at sequencing through 
G+C-rich regions while Klenow fragment is effec- 
tive for A + T-rich regions [33]. 

Thermostable polymerases are required for ther- 
mal cycle sequencing protocols and they are useful 
for sequencing templates which are generated by 
PCR. In this case, the high temperature of the 
sequencing reaction not only destabilizes template 
secondary structure but also provides increased 
priming specifity. 


P oe eaeeve 
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22.3.4 Novel DNA sequencing techniques 


22.3.4.1 Cycle sequencing 

For primer extension by polymerase to occur in the 
course of a chain termination sequencing protocol, 
DNA within the region to be sequenced must be ina 
single-stranded form. With double-stranded tem- 
plates, this is generally accomplished using either 
alkaline [13] or heat [44] denaturation. During 
polymerization, each molecule of template is copied 
once as the complementary primer-extended strand. 
Since a finite number of template molecules are 
present in a ‘one-cycle’ sequencing reaction, the 
purity and concentration of the input template DNA 
are critical to the generation of satisfactory sequence 
data. Under these conditions, the use of highly 
purified templates is necessary to prevent false 
priming. The amount of template added to the 
reaction dictates signal strength of the data gener- 
ated. For very large templates, such as 4 or cosmid 
DNA, it is difficult to introduce sufficient template 
to achieve a good signal. Templates which exhibit 
secondary structure or which undergo rapid rean- 
nealing also present special sequencing challenges. 

The introduction of thermostable DNA poly- 
merases has made it feasible to repeatedly cycle a 
sequencing reaction through alternating periods of 
heat denaturation, primer annealing, extension and 
dideoxy termination. This cycling process effec- 
tively amplifies small amounts of input double- 
stranded DNA template to generate sufficient tem- 
plate for sequencing. By annealing and extending 
the primer at elevated temperatures, problems 
inherent to traditional sequencing protocols are 
eliminated. Double-stranded templates remain in a 
denatured state for longer periods, thus increasing 
the amount of template available for primer 
annealing. 

Thermal cycle sequencing is a relatively simple 
process whose success depends on the use of a 
thermostable DNA polymerase that is functional at 
temperatures which will denature the template. A 
thermostable polymerase, such as Taq DNA poly- 
merase [9], Bst DNA polymerase [60], Tth DNA 
polymerase [61] or Vent DNA polymerase [62], need 
only be added once to the reaction, thus making 
programmed cycling possible. Primer is annealed 
and then extended by the polymerase at temper- 
atures sufficiently high to minimize the secondary 
structure of the template and strand reannealing. 
The newly synthesized strand is then dissociated 
from the template by heating. When the reaction 
is cooled, more primer anneals and a second round 
of synthesis occurs. Each cycle increases the amount 
of product available for sequencing, with the 


567 CHAPTER 22 SEQUENCING CHEMISTRIES 


theoretical yield roughly equal to the number of 
cycles performed. 

Since so little material is required, template purity 
is less critical. Even slightly impure templates can be 
diluted such that impurities which may be present 
are at non-interfering levels. Because the template is 
amplified, it is even possible to sequence DNA 
extracted from a single colony or plaque [24], if 
efficient lysis conditions are used. The linear amplifi- 
cation achieved by cycling compensates for reduced 
template amount and can generate detectable 
sequence where standard sequencing methods fail. 

With cycle sequencing, even traditionally 
‘difficult’ templates such as A or cosmid DNA, PCR 
fragments, GC-rich templates or single-stranded 
templates with difficult stretches can yield satis- 
factory sequence data. The high annealing and 
polymerization temperatures that are used are 
advantageous in several respects. First, since rapid 
strand reannealing is inhibited, the template re- 
mains denatured longer and template utilization is 
thus more efficient. Second, template secondary 
structure is minimized so that the polymerase is less 
likely to dissociate from the template. Even highly 
structured regions can be satisfactorily sequenced 
using thermal cycling conditions. And, third, since 
hybridizations are more stringent at higher temper- 
atures, nonspecific primer annealing is reduced, 
resulting in less background. 

Cycle sequencing can be performed with either 
end-labelled primers or with primers that incor- 
porate label through extension reactions. When false 
priming is a problem, the use of end-labelled 
primers offers several advantages. Since only the 
specific sequencing primer is labelled prior to 
annealing, chains extended from primers binding 
nonspecifically to other sites do not contribute to 
background on the sequencing gel. Only those 
sequences derived from end-labelled primer will 
be detected. False signals sometimes seen when 
sequencing impure DNA with the label-incorporat- 
ing protocol will thus be avoided. Moreover, the use 
of end-labelled primers is generally less expensive, 
since labelled primer can be prepared in bulk for use 
with many templates over a period of several days. 
End-labelled primers are especially useful for 
generating sequence data very close to the primer, 
and for sequencing large double-stranded DNA 
templates. Since each molecule contains only a 
single radiolabel, this gives an essentially uniform 
band intensity throughout the sequencing ladder. 
Furthermore, degradation of the sequencing pro- 
ducts by radiolysis simply results in unlabelled 
fragments which are not detected on the autoradio- 
gram. 


With its inherent advantages, cycle sequencing is 
quickly becoming a standard protocol [24,65-71]. 
Unlike traditional sequencing methods, it is less 
affected by the nature and concentration of the 
template. In addition, it eliminates the need for 
tedious denaturing protocols and produces data 
which generally exhibit less background, few if any 
strong stops and readable sequence up to 500 bases 
from the primer. 


22.3.4.2 Primer-directed DNA sequencing 

The partial DNA sequences of the multiple DNAs 
cloned in the same vector can be obtained using a 
‘universal’ primer. The sequence of this primer is 
selected to be complementary to a known region of 
the cloning vector near the multiple cloning site. 
This primer is therefore universal in the sense that it 
can be used to obtain sequence information from 
any unknown insert that has been cloned into the 
particular vector from which the primer was 
derived. 

Specific primers can be used to fill gaps between 
contiguous stretches of extended sequence (contigs) 
obtained using random sequencing methods. In 
general, random sequencing will provide about 90% 
of the complete sequence of a large clone before the 
effort needed to further extend the contigs reaches 
an unacceptably high level. The remaining 10% 
usually consists of several small gaps between the 
contigs. These gap sequences may be obtained by 
choosing a specific primer from the end of a known 
contig sequence and using it to obtain sequence 
across the immediately adjacent gap region. By 
sequentially applying this strategy a limited number 
of times, the entire gap sequence can be obtained. 
Closure is realized when this process produces the 
end sequence from the adjacent contig. 

Extended sequence of a particular cloned DNA 
can be obtained using sequence information gener- 
ated from a universal primer to select a new insert- 
specific primer for a further round of sequence 
analysis. Successive cycles of sequencing and new 
primer generation yield the complete sequence of 
interest. The advantage of the approach lies in the 
large amount of sequence information that may be 
obtained from a single clone. However, the speed 
with which the entire insert sequence is acquired is a 
function of several factors: the rate at which new 
specific primers can be selected, synthesized and 
purified; the rate of new sequence acquisition for 
each primer; the amount of sequence information 
sufficiently accurate to pick a succeeding primer 
obtained per run; and the percentage of selected 
primers giving rise to interpretable sequence. For 
sizable clones, a large number of primers are 
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required and the gene-walking process can be quite 
slow and expensive, the cost of oligonucleotide 
synthesis may even be prohibitive [72]. 

22.3.4.3 DNA sequencing by primer walking 

with short oligomers 

In most strategies of DNA sequencing, such as 
shotgun or nested deletion subcloning (see Chapter 
20), the major bottleneck in efficiency is not the stage 
of sequencing, but rather, the subcloning and 
template preparation (‘front end’) and/or the 
integration of sequences from individual shotgun 
runs (‘back end’). In contrast, sequencing by primer 
walking minimizes the front and back end 
problems, using the same template many times in a 
processive manner, with a new primer for each run. 


and cost of the primer synthesis step. This step 
produces a huge excess of the synthesized primer 
(0.2-1.0 pmol) over the amount needed for a typical 
sequencing reaction (0.5 pmol) [73]. 

A proposed solution for the inconvenience and 
expense of having to synthesize a primer for each 
sequencing reaction is to use primers that are short 
enough that a manageable library of primers would 
allow any DNA molecule to be sequenced entirely 
with primers selected from the library. So, Studier 
[74] has suggested eliminating the need for syn- 
thesis of walking primers by building a library of 
presynthesized short oligonucleotides (8-mers and / 
or 9-mers, the shortest primers expected to be 
unique for plasmid-size template), but the size of 
such a potential library proved to be problematic. 

Szybalsky [75] proposed an improvement by 
performing a template-directed ligation, a technique 
which is compatible with current sequencing 
procedures. Efficient ligation requires that oligonu- 
cleotides pair at adjacent sites in the template DNA. 
He proposed that two hexamers be ligated on the 
template into a unique 12-mer primer, which means 
a 64-fold reduction in the library size (6-mers vs. 9- 
mers); but to make the procedure suitable for routine 
use, the complete ligation of hexamers on the 
template has to be shown. 

Kieleczawa et al. [76] discovered that saturating 
the template DNA with single-stranded DNA 
binding protein (SSB) stimulates strings of three or 
more unligated hexamers to prime specifically at 
the position of the string and at the same time 
suppressed priming by individual hexamers or by 
many pairs of contiguous hexamers. When template 
DNA is saturated with SSB, strings of three or four 
contiguous hexanucleotides can cooperate through 
base-stacking interactions to prime DNA synthesis 
specifically from the 3’-end of the string. Under the 


same conditions, priming by individual hexamers is 
suppressed. Strings of three or four hexamers 
representing more than 200 of the 4096 possible 
hexamers, primed easily readable sequence ladders 
at more than 75 different sites in single-stranded or 
denatured double-stranded templates. A synthesis 
of 1 pmol of hexamer supplies enough material for 
thousands of primings, so multiple libraries of all 
4096 hexamers could be distributed at a reason- 
able cost, allowing rapid and economical DNA 
sequencing. 

Kotler et al. [73] published a protocol in which 
they describe a remarkable manyfold increase in the 
sequence specifity of priming by short oligonu- 
cleotides, such as hexamers or pentamers, when 
tandemly annealed to the DNA template, as 
compared when each is annealed separately. In 
DNA sequencing reactions this phenomenon results 
in unique priming by what they term a ‘modular 
primer’, a tandem string of two or three short 
oligonucleotides, with no ligation required. In 
contrast, the same pentamers or hexamers show 
nonunique, multiple priming when used ‘one at a 
time’ without adjacent partners. This effect is 
interpreted as resulting in part from the increase in 
the affinity of the oligonucleotides for the template 
caused by their base-stacking, as they anneal to the 
template next to them, in comparison with their 
annealing alone, with no neighbours. Modular 
primers showed a 91% success rate in sequencing 
reactions, which is comparable to the performance 
of conventional 17-mer primers. A complete 
oligonucleotide library of all possible pentamer or 
hexamer sequences comprises only 1024 or 4096 
samples, respectively, and would remove the need 
for synthesis of new primers for each walking step. 
Not only time but also cost per walking step is thus 
reduced, since the scale of oligonucleotide synthesis 
is sufficient to produce thousands of libraries for 
users. 


22.3.4.4 Sequencing by hybridization 

The success of the human genome project will 
depend on whether DNA sequencing approaches 
can greatly increase throughput and decrease cost. 
Strategies that may help to accomplish this task 
include a greatly improved method of sequencing 
based on the conventional automated fluorescent 
DNA sequencers and the advent of new sequencing 
technologies [77]. 

Any linear sequence is an assembly of overlapping, 
shorter sequences. Sequencing by hybridization 
(SBH) [78,79] is based on the use of oligonucleotide 
hybridization to determine the set of constituent 
subsequences (such as 8-mers) present in a DNA 
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fragment. Unknown DNA samples can be attached 
to a support and sequentially hybridized with 
labelled oligonucleotides (format 1); alternatively, 
the DNA can be labelled and sequentially hybri- 
dized to an array of support-bound oligonucleotides 
(format 2). Highly discriminative hybridization is 
required to distinguish between perfect DNA 
fragment and oligonucleotide complementarity and 
all hybridizations exhibiting one or more nucleotide 
mismatches [78]. Reliable conditions for this 
discrimination have recently been determined for 
format 1 [78]. 

The sets of n-mer oligomers used as hybridization 
probes can vary in number from several hundred to 
all possible combinations (65536 for octamers), 
depending on the type of sequence information 
required. The completeness of the probe set and its 
design, which can vary according to such para- 
meters as length of probe and internal or flanking 
positioning of unspecified bases, determines the 
kind of sequence information that can be extracted 
from individual DNA fragments or libraries of 
fragments. Mapping information that determines 
clone overlap can be obtained with 100-200 probes. 
The positioning and identification of genome 
structural elements (partial sequencing) requires 
500-3000 probes [78], and complete sequencing 
requires data from 3000 or more septamer probes on 
three to five related genomes. 

By combining SBH and traditional gel sequencing 
methods, overall throughput and accuracy can be 
improved by at least an order of magnitude. SBH 
can be used to fill in gaps and check errors. It is 
possible that SBH could itself become a primary 
method of genomic sequencing except for segments 
consisting of multiple tandem repeats. In addition, 
because of the combinatorial and parallel features of 
SBH, several similar genomes could be readily 
sequenced simultaneously [78]. Furthermore, SBH is 
an ideal method for identification of genes, repeats, 
and motifs in chromosomes and cDNAs as well 
as for studying the differences between related 
populations and species. 

It is expected that SBH will contribute to the 
development of rapid and inexpensive techniques 
for mapping clones, facilitate DNA sequence analy- 
sis, and extend the applicability of DNA finger- 
printing for diagnostic use [78]. 


22.4 Direct sequencing of 
PCR products 


22.4.1 Overview 
PCR has gained widespread application, allowing 


sequences of interest to be amplified from a complex 
sample of nucleic acid [9,42,80]. PCR is based on the 
use of two oligonucleotides to prime DNA-poly- 
merase-catalysed synthesis from opposite strands 
across a region flanked by the priming site of the 
two oligonucleotides. By repeated cycles of DNA 
denaturation, annealing of oligonucleotide primers, 
and primer extension, an exponential increase in 
copy number of a discrete DNA fragment can be 
achieved. Although the cloning of amplified DNA is 
relatively straightforward, direct sequencing of PCR 
products facilitates and speeds the acquisition of 
sequence information. As long as the PCR reaction 
produces a discrete amplified product, it will be 
amenable to direct sequencing. 

In contrast to methods where the PCR product is 
cloned and a single clone sequenced, the approach 
in which the sequence of PCR products is analysed 
directly is generally unaffected by the compara- 
tively high error rate of Taq DNA polymerase. Errors 
are likely to be stochastically distributed throughout 
the molecule, thus, the overwhelming majority of 
the amplified product will consist of the correct 
sequence. The only exceptions to this rule may be 
those cases where only a very small number of 
template molecules are present at the outset of the 
reaction. In such cases, an error occurring in the 
first cycle of amplification may be exponentially 
amplified [81]. 

Templates for sequencing can advantageously be 
produced by PCR from any DNA-containing source. 
It is even useful for amplifying sequence from 
already cloned material, since it avoids having to 
grow clones and isolate DNA. PCR products can be 
sequenced by either the dideoxy (Sanger) [1] or the 
chemical (Maxam-Gilbert) [2] approach. 


22.4.2 Generation of sequencing template 


By using PCR, templates for sequencing can be 
generated more efficiently than with cell-dependent 
methods either from genomic targets or from DNA 
inserts cloned into vectors. Amplification of cloned 
inserts can be achieved using oligonucleotides that 
are priming inside, or close to, the polylinker of the 
cloning vector [82]. 

Sequencing the PCR products directly has two 
advantages over sequencing cloned PCR products. 
First, it is readily standardized because it is a simple 
enzymatic process that does not depend on the use 
of living cells. Second, only a single sequence needs 
to be determined for each sample. 


22.4.2.1 Generation of single-stranded DNA templates 
Most of the difficulties arising in sequencing double- 
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stranded DNA are derived from strand reassociation 
during the sequencing reaction. These problems can 
be avoided by preparing single-stranded DNA tem- 
plates by any of the following number of methods. 


Asymmetric PCR In this approach, an excess of one 
amplified strand (relative to its complement) is 
generated by the addition of one primer in vast 
excess over the other. The resulting excess of single- 
stranded product is then used as a template for the 
production of the dideoxy-terminated chains from 
which the sequence is derived. 

During the first 20-25 cycles, double-stranded 
DNA is generated, but when the limiting primer is 
exhausted, single-stranded DNA is produced for the 
next 5-10 cycles by primer extension. Using an 
initial ratio of 50 pmol of one primer to 0.5 pmol of 
the other primer, the amount of double-stranded 
DNA accumulates exponentially to the point at 
which the primer is almost exhausted, and there- 
after stops. The single-stranded DNA generation 
starts at about cycle 25, the point at which the 
limiting primer is almost depleted. In general, a ratio 
of 50: 1-0.5 pmol for a 100-1 PCR reaction will result 
in about 1-3pmol single-stranded DNA after 30 
cycles of PCR [82]. 

Low yields of single-stranded DNA using asym- 
metric PCR may reflect either too little of the limiting 
primer, preventing the accumulation of enough 
double-stranded DNA as a template for the primer- 
extension reaction, or it reflects too much of the 
limiting primer, saturating the reaction with double- 
stranded DNA before any single-stranded DNA is 
produced [82]. 

The single-stranded DNA generated can then be 
sequenced using either the PCR primer that is 
limiting, or any complementary sequence internal to 
the 3’-end of the single-stranded template. 

The use of asymmetric primer ratios does not 
always result in reproducible high yields of single- 
stranded product. An alternate protocol entails 
isolation of double-stranded PCR products followed 
by reamplification, which is performed in the 
presence of a single primer. The asymmetric PCR 
has the advantage that, because the limiting primer 
is exhausted, there is no need to remove excess 
primers prior to initiating the sequencing reaction. 
Protocols for generation of templates by asymmetric 
PCR are described in refs 81-84 (see also Chapter 21, 
Protocols 109 and 110). 


Lambda exonuclease-generated single-stranded DNA 
An alternative approach for generating single- 
stranded products, which does not require the use of 
unequal primer concentrations is described by 


Higuchi and Ochmann [85]. In this procedure, one of 
the oligonucleotide primers is treated with poly- 
nucleotide kinase to introduce a 5’-phosphate prior 
to the PCR. After a symmetric PCR, the products are 
exposed to the double-strand-specific 5’3’ exonu- 
clease, which is only active if a phosphate is present 
at the 5’-position. Only the strand flanked by the 
phosphorylated primer will be degraded. The 
single-stranded DNA is then purified from the 
reaction mix and used for sequencing. 


Solid-phase DNA sequencing Solid-phase methods 
of producing sequencing templates from PCR 
products for conventional, including fluorescent, 
methods of nucleic acid sequencing are gaining 
widespread use (see Chapter 21). They produce 
quality templates at a high rate, particularly when 
automated, and avoid centrifugation, phenol extrac- 
tion, ethanol precipitation, or column chromato- 
graphy [29,30,86,87]. 

Strongly binding a single-strand from a PCR 
reaction to a solid phase allows the remainder of the 
reaction components to be removed by thorough 
washing. Streptavidin-coated magnetic beads have 
been successfully used as the solid phase to pioneer 
the approach [30] (see Chapter 21, Protocol 113). 
Streptavidin has an extremely high affinity 
(K,=10"5 mol") and specifity for biotin [88]. The 
strand to be immobilized on streptavidin-coated 
magnetic beads must therefore contain biotin, which 
is conventionally achieved by biotinylating the 
specific primer. 

The beads on which a strand has been immo- 
bilized can be washed repeatedly. They are collected 
by a magnet, other reaction components aspirated, 
and then the beads resuspended in fresh liquid, 
following removal of the magnet. 

Each strand of a PCR can be prepared separately 
for sequencing because in non-denaturing condi- 
tions the non-biotinylated strand remains hydrogen 
bonded to the immobilized strand but can be 
removed by strong alkaline denaturing conditions 
without affecting the biotin-streptavidin binding. 

Protocols for generation of single-stranded tem- 
plates by solid-phase technique are described in refs 
82, 83 and 89. 


22.4.2.2 Generation of double-stranded DNA templates 

Although double-stranded, closed circular DNA 
templates can be alkaline-denatured and sequenced 
using dideoxy chain termination protocols, this 
method of template preparation gives poor results 
with linear PCR products. Many problems asso- 
ciated with direct sequencing of PCR products result 
from the ability of the two strands of the linear 
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amplified product to reassociate rapidly after denat- 
uration. This leads either to blocking the extension 
of the primer-template complex or to preventing the 
sequencing oligonucleotide from annealing effi- 
ciently [82,83]. This problem is more severe for 
longer PCR products. To circumvent the strand 
reassociation of double-stranded DNA, a number of 
alternative methods have been developed. 


Precipitation of denatured DNA The template DNA is 
denatured in 0.2M NaOH, neutralized by adding 0.4 
vol. of 5M ammonium acetate (pH7.5), and DNA is 
immediately precipitated with 4 vols ethanol. The 
DNA is resuspended in sequencing buffer and 
primer at the desired annealing temperature [82]. 


Snap-cooling of template DNA The template DNA is 
denatured by heating (95°C) for 5min and quickly 
freezing the tube by putting it in a dry-ice-ethanol 
bath to slow down the reassociation of strands. The 
sequencing primer is added either prior to or after 
denaturation and brought to the appropriate anneal- 
ing temperature [82,90]. 


Cycle sequencing of PCR Another choice for gener- 
ating enough sequencing template is to cycle the 
sequencing reaction, using appropriate thermo- 
philic DNA polymerases, such as Tag DNA poly- 
merase I, as enzymes for both amplification and 
sequencing. Even though only a small fraction of the 
templates will be utilized in each round of extension 
and termination, the amount of specific termina- 
tions will accumulate with the number of cycles 
[65,66,82,90]. A more detailed description of cycle 
sequencing is given in Section 22.3.4.1. 


22.4.3 Sequence analysis of 
PCR products in practice 


Both the dideoxy and the chemical sequencing 
approaches can be used to analyse PCR products 
directly. The general differences, advantages and 
shortcomings of the two sequencing approaches 
are discussed in Sections 22.3.3.7 and 22.3.3.8. In 
general, the dideoxy method involves fewer steps 
than chemical sequencing. Where thermophilic 
DNA polymerases are used for sequencing reac- 
tions, the same buffer will serve for the amplification 
and sequencing steps, and products do not have to 
be cleaned and isolated repeatedly. 


22.4.3.1 Enzymatic approach 

The technology and procedures used in dideoxy 
sequencing are by now well standardized; both T7 
DNA polymerase and Klenow fragment as well as 


Taq DNA polymerase or other appropriate ther- 
mophilic DNA polymerases can be employed. A 
number of commercially available dideoxy ‘kits’ 
yield good results. Provided that single-stranded 
templates of good quality are used, the dideoxy 
methods permit rapid sequencing of amplified 
products [81]. 


22.4.3.2 Chemical approach (genomic sequencing) 

The genomic sequencing approach of sequencing 
PCR products combines the PCR amplification 
method [9,36] with the genomic sequencing techni- 
que [2,91]. Following PCR, the amplified DNA is 
chemically sequenced, transferred by electro- 
blotting, and covalently bound by UV crosslinking 
onto a nylon filter that can be repeatedly probed 
with short, sequence-specific oligonucleotides. This 
method has proved particularly useful for sequenc- 
ing large regions. In such cases, the sequences of 
interest can be amplified simultaneously in a single 
PCR reaction, or separately as a set of discrete 
adjacent or overlapping fragments. Several ampli- 
fied fragments can be mixed, simultaneously 
sequenced, run on gels, and the sequence of the 
different fragments successively visualized by the 
use of appropriate end-labelled probes. In addition, 
sequences from both strands can be derived from a 
single filter, with a consequent increase in sequence 
accuracy. The direct chemical sequencing of a 
labelled strand is a fast alternative if only a single 
product and a single strand is being sequenced. The 
utility of this direct approach is thus limited to those 
cases when the PCR yields a single DNA species or 
when the product of interest can be readily purified 
[82]. 

Detailed protocols for direct sequencing of PCR 
products based on the enzymatically mediated 
chain termination method are described in several 
publications [59,81,83,84]. Chemical sequencing of 
PCR products is described in refs 81, 92 and 93. 


22.5 Automation in DNA sequencing 


One of the major advances in sequencing technology 
in recent years is the development of automated 
DNA sequencers, which automate the gel electro- 
phoresis step, detection of DNA band pattern, and 
analysis of bands. These machines are based on the 
chain termination method and utilize fluorescent 
rather than radioactive labels. The fluorescent dyes 
can be attached to the sequencing primer, to the 
dNTP, or the ddNTPs, and are incorporated into the 
DNA chain during the strand synthesis reaction 
mediated by a DNA polymerase, such as Sequenase, 
Klenow fragment or Taq DNA polymerase. The four 
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sets of oligonucleotides generated by the sequencing 
reactions are loaded onto the gel manually and 
electrophoresis is then controlled automatically. 

Detection occurs at a point near the bottom edge 
of the gel in one of two ways. In one method, 
applicable for either fluorescently or radioactively 
labelled DNA, the bands of DNA moving sequen- 
tially past a detector are recorded. In the second 
me hod, the banding pattern of fluorescently 
labelled DNA is detected using an imaging camera. 
All automated sequencers include data-collection 
capabilities and include either further analysis 
programs or allow the data to be taken to external 
data-analysis software programs. A number of such 
sequencing machines are now commercially 
available and are becoming increasingly popular 
[77,94-96]. 

Automation by robotics of template preparation 
and purification and of the sequencing reaction is 
under development [33,98-103]. 

Research is under way to develop the technology 
of mass spectrometry for DNA sequencing [104], a 
technique that could replace the gel electrophoresis 
step in DNA sequencing. Alternatively, resonance 
ionization spectroscopy combined with mass 
spectrometry could enable much faster analysis of 
isotopically labelled DNA bands [32,104]. 


22.6 Future techniques in 
DNA sequencing 


The demand for improved DNA sequencing metho- 
dologies posed by the Human Genome Project has 
spurred the development of both conventional and 
unconventional approaches [79,105]. 


22.6.1 Gel electrophoresis 


The slow step in the operation of automated DNA 
sequencers is electrophoresis. Typically, the separa- 
tion requires at least 12-14h, during which the 
expensive instrument is fully occupied. If the speed 
of the separation could be substantially increased, 
the throughput of the instrument could be increased 
correspondingly, decreasing the cost of the sequenc- 
ing process. 

Two recent developments have the potential to 
significantly increase the throughput of electro- 
Phoresis-based sequencing instruments: ultrathin 
gel electrophoresis in ultrathin (50 im) gels increas- 
es heat transfer efficiency, which permits higher 
electric fields to be applied without deleterious 
thermal effects resulting in correspondingly more 
rapid separations. Employing this technique, separa- 
tion of fragments up to 600 bases in length can 


be achieved in 2h. This separation rate is a factor 
of four to five greater than can be achieved in 
conventional slab-gel electrophoresis. Work is in 
progress on such systems, both with arrays of 
capillaries [106] and ultrathin slabs [107]. 

Several efforts have been undertaken to make 
capillary electrophoresis usable for DNA sequenc- 
ing [108]. Currently, the principal advantage of this 
method is its speed, which can be up to 25-fold faster 
than conventional analysis [108]. To date, most 
capillary electrophoresis systems have used a single 
capillary. To compete with slab-gel electrophoresis, 
many capillaries must be used simultaneously. The 
presently available systems allow separation of 
fragments up to 320 bases in length. Sequence length 
must be increased to be useful in large-scale 
sequencing efforts. 


22.6.2 Scanning microscopy 


In this approach, atomic force or scanning tun- 
nelling microscopy (STM) would be used to 
generate a high-resolution image of individual DNA 
molecules. The goal is to identify individual bases in 
single-stranded DNA; the sequence is determined 
by imaging along the length of the strand. Early 
work with atomic-scale imaging led to high hopes 
for this sequencing method. The best achievable 
resolution of DNA fragments is between 2 and 5nm. 
Although it is possible to observe hints of helical 
structure in double-stranded DNA with the current 
generation of microscopes, more than one order of 
magnitude improvement in resolution will be 
required to produce images with sufficient resolu- 
tion to determine DNA sequence accurately [109, 
110]. 


22.6.3 Mass spectrometry 


Mass spectrometric approaches range from simple 
replacement of the fluorescence detector in gel 
electrophoresis with a mass spectrometric detector 
to ambitious approaches for sequence determinat- 
ion on a single large DNA molecule in an ion trap. 
However, before mass spectrometry can be used for 
sequencing, three issues need to be addressed. First, 
the sequencing sample must be introduced to the 
gas phase without damage. Second, the sequencing 
fragments must be separated on the basis of mass. 
Last, the massive DNA fragments must be detected 
efficiently. 

An intermediate approach being pursued in 
several laboratories is the replacement of the gel 
electrophoresis separation of Sanger fragment sets 
with mass spectrometric separation and detection. 
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This possibility arises because of the relatively new 
technique of matrix-assisted laser desorption 
(MALDI), which permits singly charged ions from 
proteins as large as 300000 Da to be produced and 
mass analysed [105]. With the best matrix for mixed 
sequence oligomers found to date, 3-hydroxypi- 
colinic acid, the largest oligomer determined was 
67 bases, with about 10 pmol of each component 
analysed [105]. 

A factor of 10 in mass range and more than a factor 
of 10 in resolution will be required to achieve useful 
sequencing accuracy. If the progress in this field over 
the last three to four years is continued in the future, 
separations of Sanger mixtures may be performed in 
seconds, which compares well with the hour or so 
required even in the ultrafast gel electrophoresis 
system. 


22.6.4 Sequencing by hybridization 


This up-and-coming technology is described in 
more detail in Section 22.3.4.4. Sequence infor- 
mation is obtained by hybridization of small probes 
to a target to be sequenced. Two formats have been 
proposed for the method. In one, many targets are 
arranged on a support and hybridized successively 
with every possible oligonucleotide probe. In the 
case of 8-mer probes, this could require as many 
as 65000 successive hybridizations, a somewhat 
intimidating prospect. In the other format, this 
scenario is inverted by arraying the 65000 probe 
oligomers on a support and hybridizing the target 
sequence to the array; the hybridization pattern then 
determines the sequence. Several technical issues 
arise in practice: A particularly thorny one is the 
effect of repetitive DNA upon sequence recon- 
struction. Because of this, the method is unlikely to 
serve as a primary sequencing tool for complex 
genomes. But these arrays could be very powerful 
for the sequence analysis of short non-repetitive 
DNA fragments and used in conjunction with 
primary sequence data derived by other methods to 
provide a rapid means of confirming and correcting 
sequence data [77—79,105]. 
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23.1 Introduction 


Before the products of sequencing reactions (see 
Chapter 22) can be analysed, they have to be 
separated. For this purpose, slab gel electrophoresis 
is generally used, although in recent years, capillary 
electrophoresis has been developed as a possible 
alternative. 

For large sequencing projects, this separation step 
is currently the bottle-neck in the whole sequencing 
procedure. In order to read a sequence, single base 
resolution is needed. The sequencing reactions 
produce DNA fragments of up to a thousand bases 
or more, but it is not yet possible to separate 
molecules of that size at high resolution. Some effort 
has been made to improve the readability of the gels, 
but the average reading in real sequencing projects is 
still only a few hundred bases [1-3]. 

It has to be pointed out that the limits of electro- 
phoretic separation are intrinsic to this method. The 
resolution (R) that can be achieved depends on 
different factors: generally, it is defined as the band 
separation AX (distance between the centres of two 
bands) divided by the average of the two band 
widths W, and W, (Fig.23.1): 


AX 


R2se=—- = 


Wp = 40 


Fig. 23.1 Schematic concentration profile of two bands of 
gaussian shape, showing the parameters influencing the 
resolution. AX, interband distance; W,, band width at the 
inflection point (W,=20); W,, band width at half-height 
(W,, =26 (2In 2); W,, band width at the baseline, 
estimated by drawing the tangents to the peak at the 
inflection point and extrapolating to the baseline 

(W, =40). The resolution between the two bands is 
defined as R=2AX/ (Wy, + Wyo) =(2In 2AX/ (Wi + Wyo), 


using the band width at the baseline or at half-height, 
respectively. 


Therefore, we can distinguish between two main 
factors, that is, ‘band-spacing effects’ and ‘band- 
broadening effects’, which influence the resolution. 

The interband spacing is given by the length of the 
migration path and the velocity difference between 
the two species, which in turn is determined by the 
migration mechanism. In sequencing gels, different 
separation mechanisms can be identified [4,5]. Small 
DNA fragments are separated by a sieving mecha- 
nism [6,7] and their mobility is proportional to 
exp(—molecular weight). With increasing molecular 
weight, they become larger than the pore size and 
they begin to ‘reptate’ [8-10]. As a consequence, 
their mobility becomes inversely proportional to 
their size. Both mechanisms give a good band 
separation. With a further increase in molecular 
weight, however, the mobility reaches a plateau and 
separation fails. From theoretical considerations 
[8-10] and experimental studies of double-stranded 
DNA in agarose [11,12], this effect is known to be 
due to orientation of the molecules. This upper 
separation limit decreases with increasing electric 
field, which means that higher electric fields are 
counterproductive. However, recent data indicate 
that at very high fields a different migration process 
might occur, which seems to preserve good band 
separation up to a thousand bases [13]. 

Experimental studies [4,14] and numerical esti- 
mates [15] show that in sequencing gels under the 
usual conditions (40-50 V cm“, 6% polyacrylamide), 
molecular orientation only takes place at a frag- 
ment size of about 1.5-2 kb, which is well above the 
current practical limit of gel readings. Therefore, 
molecular orientation cannot be the main reason for 
the limited readability. 

The second limiting factor is the band width, that 
is, the resolution limit is reached when bands 
become so broad that they overlap. The band width 
can be influenced by the migration mechanism (e. g. 
ref. 16), but is mainly due to dispersion effects 
independent of the migration mechanism. Assum- 
ing that the bands have a gaussian shape, their 
width is determined by the variance of the con- 
centration distribution. The band width can be taken 
as the width at half peak height (W,) or as the 
width at the base, which is four times the square root 
of the variance (W, =4 02; see Fig. 23.1). The total 
variance is the sum of the individual contributions 
of different (presumably independent) factors, for 
example diffusion (dif), Joule heating and tempera- 
ture profile (AT), adsorption (ads), initial band width 
(ibw) and other possible sources [17]: 


(23.2) 


tot dif AT ads ibw oth 
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Among these, probably the most important 
contributors are diffusion and Joule heating. Band 
broadening due to diffusion increases as the square 
root of time (e.g. ref. 18), which means that reducing 
the run time is advantageous. Assuming a parabolic 
temperature profile, Joule heating causes the band 
width to increase with the square of the gel thickness 
and the square of the electric field [18,19] (other 
authors estimate a third-power dependence [5,20]), 
which is a strong motivation to use thin gels. 
Therefore, regarding the electric field, a compromise 
has to be found between the reduction of run time 
(i.e. to reduce the diffusion) on one hand and the 
increase in band width due to thermal effects on the 
other. In fact, there is an optimal field strength, 
where the band spreading is minimized [5,18,19]. 

In practice, the band width is also strongly de- 
pendent on the ‘quality’ of the gel and the loading. 
In order to achieve optimum results, any factor that 
could increase the band width has to be minimized 
or avoided. This includes for example impurities or 
air bubbles in the gel, badly formed wells, urea in the 
wells, etc. Depending on the gel concentration, the 
mobility of the sample is slowed down when 
entering the gel. Therefore it becomes concentrated 
and the initial band width is reduced, but irregular- 
ities on the surface of the gel are ‘imprinted’ into the 
band shape. 

In classical slab gel electrophoresis, it is a band- 
spacing effect that limits the readability. As ex- 
plained above, the mobility of the (non-orientated) 
DNA molecules is inversely proportional to their 
length. As the run time is the same for all fragments, 
the relationship between distance travelled and 
fragment size is a hyperbolic function, which means 
that there are widely spaced bands at the bottom and 
crowded bands at the top of the gel. In other words, 
the large fragments do not travel far enough to be 
well separated. It is this reduction in interband 
spacing that limits the readability of classical 
sequencing gels [15]. Possible solutions to this 
problem are the use of very long gels or the use of 
gradient gels (see below). 

However, there exists another solution to this 
problem. Instead of recording the band pattern in 
the whole gel at a fixed time, the passing bands can 
be detected at a fixed position in the gel over a long 
time period. As all fragments have to travel the same 
distance, the relation between retention time and 
fragment size becomes a linear function, and in 
consequence, constant interband or peak spacing is 
obtained in the nonorientated regime (e.g. ref. 21). 
This principle has been realized in direct blotting 
electrophoresis as well as in electrophoresis with 
on-line detection (automated sequencers, capillary 


electrophoresis), and in this case it is the band 
broadening that has been identified as the limiting 
factor [15]. 

If the band width or the reduction in interband 
spacing at the top of the gel could be minimized, the 
separation would eventually be limited by a zero 
velocity difference due to molecular orientation [22]. 
Recently, the use of high gel concentrations and high 
electric fields for the optimization of sequencing 
electrophoresis within that limit has been proposed 
[23]. 

Pulsed electric fields, which can successfully 
cancel the molecular orientation of large double- 
stranded DNA molecules in agarose gels and 
therefore expand the range of separation, have been 
shown to give only slight improvements in poly- 
acrylamide gels at usual field strengths [4,24—27]; 
however, they might be more effective at higher field 
gradients [15]. Trapping electrophoresis [14] has 
been proposed, but band broadening seems to be a 
major problem. 

So far, the idea of sequencing thousands of bases 
from a single sample on a polyacrylamide gel 
remains an illusion. The most promising way to 
enhance the amount of information obtained from a 
single gel seems to be the multiplex sequencing 
method [28] (see Chapter 20). However, enhancing 
the readability of a gel is still important, as it would 
reduce the number of clones needed in shotgun 
sequencing and would also reduce the amount of 
work needed if ordered strategies are employed. 

The factors limiting sequencing gel electrophore- 
sis are listed in Table 23.1 along with their remedies. 


23.2 Slab gel electrophoresis 


Slab gel electrophoresis is the ‘classical’ method for 
DNA sequencing. This process can be automated, 
and there are a few commercially available instru- 
ments on the market, all of them based on fluo- 
rescence labelling and detection. However, as these 
instruments are still rather expensive, the manual 
method is still in use in many laboratories (see 
Protocol 118). 


23.2.1 Manual sequencing 


23.2.1.1 Gel matrix 

In free solution, nucleic acids of different sizes have 
the same electrophoretic mobility and cannot be 
separated in an electric field. A matrix is therefore 
needed, which on one hand serves as an anticon- 
vective medium and on the other has to have good 
‘sieving’ properties. At present, polyacrylamide gels 
are the only matrices which have a high enough 
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Table 23.1 Limiting factors in sequencing gel electrophoresis. 


Technique 
Limiting factor Classical slab gel DTE/Gel readers CE Remedy 
Loss of separation due to * de ates oS — 
molecular orientation se pulsed fields 
Use trapping 
electrophoresis 


Bandwidth, due to: oe 


Initial band width 


Diffusion 


Joule heat 


Loss of separation due vee 


to reduced interband 
spacing at the top 


CE, capillary electrophoresis. 
**Main factor. 
* Secondary factors. 


resolving power for DNA sequencing. The gel is 
prepared by polymerizing acrylamide monomers in 
the presence of a crosslinker, which forms covalent 
bridges between the polyacrylamide chains, result- 
ing in a three-dimensional network (Fig. 23.2). There 
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Fig.23.2 Structure of polyacrylamide, 
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are a number of different crosslinkers (see refs 
29 and 30 for reviews), but N,N “methylene-bis- 
acrylamide (‘bis-acrylamide’) is by far the most 
popular. 

The following terminology for describing gel 
composition is very useful and has been widely 
adopted: the gel concentration is given as the total 
monomer concentration (acrylamide plus cross- 
linker), for example ‘6%T’, whereas the crosslinker 
concentration is given in percentage of the total 
concentration (‘%C’), 

The pore size of the gel is dependent on both the 
crosslinker and the total concentration, but from 
electron micrographs [31] and experimental studies 
[32,33] it is known that an acrylamide: bis- 
acrylamide ratio of 19:1 (i.e. 5%C) gives the smallest 
pore size and therefore the highest resolution for any 
given total concentration, %T [34]. 

The choice of the gel concentration is determined 
by the competition between the maximum inter- 
band spacing that can be obtained and the reduction 
of the spacing that occurs in classical slab gels, as 
described above. Concentrated gels (e.g. 8 or 10%T) 
have a high resolving power, but this is drastically 
reduced with increasing molecular weight (more 
than about 200 bases). Diluted gels (e.g. 4%) can 
Separate much larger molecules, but at lower 
resolution [34]. Therefore, for standard sequencing 
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gels, a concentration of 6%T is the best compromise. 

To initiate the polymerization reaction, a catalyst 
that produces free radicals is needed. As for the 
crosslinker, there are a number of catalyst systems 
for polyacrylamide (see ref. 35 for review), but for 
DNA sequencing gels only one catalyst-redox 
system is generally in use: ammonium persulphate 
(APS), which acts as radical donor, in combina- 
tion with N,N,N’,.N-tetramethyl-1,2-diaminoethane 
(TEMED) as the catalyst. 

After addition of the initiator, gelation should take 
place within 20-30 min; the time can be tested using 
an aliquot of the gel solution in a closed vessel (i.e. 
reaction tube). However, as the polymerization 
reaction continues, it is better to prepare the gel well 
before the start of the electrophoresis (for the 
dynamics of the polymerization of acrylamide, see, 
e.g. ref. 36). If a sequencing run is to be done within 
one day, we recommend preparing the gel first thing 
in the morning. It will then be ready for use when 
the sequencing reactions are completed (around 
2-3h in the case of enzymatic sequencing starting 
from a purified template). 

As oxygen is a ‘trap’ for free radicals, it inhibits the 
polymerization reaction. Therefore, in order to 
achieve higher reproducibility, we recommend that 
the gel solution be degassed prior to pouring. Care 
has to be taken that the gel edges are not exposed to 
the air, that is proper sealing of the glass plates and a 
fitting well-former (comb) are essential. 

All substances mentioned above have to be 
handled with care, as acrylamide and _bis-acry- 
lamide are toxic and are absorbed through the skin. 
TEMED is corrosive and APS is a strong oxidizing 
agent. Both catalysts are hygroscopic and have only 
a limited shelf life. 

Reagents should be of the highest quality and 
purity. Poor quality acrylamide and bis-acrylamide 
can contain the following: 

1 Acrylic acid, the hydrolysis product of acry- 
lamide, will polymerize with acrylamide or bis- 
acrylamide, and therefore change the properties of 
the gel. The degradation reaction of acrylamide and 
bis-acrylamide is catalysed by light; storage in the 
dark is therefore recommended. 

2 Linear polyacrylamide, which is caused by 
catalytic contaminations in the dry monomer, will 
affect the polymerization of the gel and the effective 
acrylamide concentration, leading to loss of repro- 
ducibility. 

3 Metal ions as contaminants can inhibit or acceler- 
ate the polymerization or affect the mobility of the 
DNA. 

APS is very hygroscopic and decomposes almost 
immediately when dissolved in water. The result is 


loss of activity. As this compound affects the rate of 
polymerization, it is important to prepare it fresh 
daily in order to achieve reproducible results. Alter- 
natively, aliquots of a stock solution can be frozen 
and discarded after use. 

TEMED is very reactive and subject to oxidation. 
The oxidized form is yellow and less reactive. As 
TEMED is also hygroscopic, it will accumulate 
water, which again accelerates the oxidative decom- 
position. Therefore, only water-free TEMED, greater 
than 99% pure, should be used. 

Recently, new monomers, such as N,N-di- 
methylacrylamide (DMA) and_ similar alkyl- 
substituted acrylamides have been introduced (see 
refs 37 and 38 for reviews). A commercial product 
containing these formulations (HydroLink™) has 
been shown to be useful for DNA sequencing [39]. 
The authors claim an increase in readability. 

Another novel monomer, N-acryloylamino- 
ethoxyethanol (AAEE) has been synthesized, which 
is much more hydrophilic than DMA and therefore 
better suited for serving as an electrophoretic matrix 
[38]. Both poly(AAEE) and poly(DMA) have much 
better resistance to hydrolysis under both acidic and 
alkaline conditions than polyacrylamide [37,38], 
which would allow the use of buffers with higher 
pH, which would in turn help keep the DNA fully 
denatured during electrophoresis. 


23.2.1.2 Buffer 
For DNA sequencing gels, 1x TBE buffer (89mm 
Tris, 89mM boric acid, 2.5mmM EDTA, pH8.3) is 
normally used. However, a precipitate can form 
during prolonged storage of concentrated stock 
solutions. The remedy is to use a modified TBE 
(133mM Tris, 44mm boric acid, 2.5mm EDTA, 
pH 8.8) which does not precipitate in concentrated 
form. Such a buffer with a higher pH has been 
reported to give a better resolution [34]. 

Again, only reagents of high quality should be 
used with a low amount of metal or non-buffer ion 
contaminants. 


23.2.1.3 Denaturants 
Intramolecular base pairing can occur withina DNA 
strand, resulting in the formation of small loop 
structures within the molecule. As such confor- 
mations can alter the mobility of DNA fragments, a 
molecule with intramolecular base pairing can 
travel at the same speed as a shorter one without a 
loop. On sequencing gels, this effect manifests as a 
‘compression’, where the band separation is drasti- 
cally reduced because of the mobility change. 

In order to avoid such intramolecular base 
pairing, the gel has to be run under denaturing 
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conditions. Therefore, a denaturing agent, such as 
urea or formamide has to be added during poly- 
merization. Alkali cannot be used as it deaminates 
acrylamide and methylmercuric hydroxide inhibits 
polymerization. Most researchers use urea (at a 
concentration of 7-8M) as it does not have to be 
deionized as do most batches of formamide. 
However, the addition of formamide (up to 40%) 
increases the denaturing capacity of the gel. 

The denaturing power of these agents alone is not 
sufficient and the gels have to be run at an elevated 
temperature. Generally, a temperature of 50-60 °C is 
high enough to keep the DNA fragments denatured. 


23.2.1.4 Gel dimensions and apparatus 

The gel is poured between two glass plates, one of 
them being notched to ensure contact with the 
buffer. Its thickness is determined by the thin plastic 
strips that are used as spacers between the glass 
plates. Standard sequencing gels are 0.3-0.4mm 
thick, but the use of thin (0.1-0.2 mm) and ultrathin 
(0.05 mm) gels has been reported [34,40]. Pouring of 
these gels requires special methods, such as the 
sliding technique [41], or the clapping technique 
[42,43]. Thinner gels generate less Joule heat and can 
therefore be run at higher voltages, resulting in 
lower run times and less diffusion. The temperature 
gradient across the gel is smaller, also causing less 
band spreading. However, thin gels are very fragile, 
and sample loading is difficult, whereas thicker gels 
accept larger sample volumes, but take longer to fix 
and dry. 

As the distance between bands increases linearly 
during the run and band broadening only increases 
with the square root of time, the number of resolved 
bases should increase with longer runs. To prevent 
the short fragments running out of the gel, long gels 
are required and indeed such gels (100-120cm) have 
been reported to give a resolution up to 600 bases 
[34]. However, large gels are difficult to handle and 
the gain of information often does not justify the 
effort. The ‘standard’ gel length of 40-50 cm is much 
more convenient and matches the common size of 
gel dryers, X-ray films and cassettes. 

The gel width is very much a matter of personal 
preference, but should also be chosen according to 
the film size (e.g. 20 or 40cm). As the very edges of a 
gel should not be used because of the temperature 
gradient, the actual loading width of a gel 20cm 
wide is about 15-16 cm, which is enough to load 40 
samples (10 clones). 

There are a number of devices for manual 
sequencing with different features. We obtained 


very good results with a simple two-tank design as 
described in ref. 44, 
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23.2.1.5 Sample wells and loading 

The slots that accommodate the sample are formed 
by introducing a comb at the top of the gel 
immediately after pouring. The size and number of 
the wells is again a matter of personal preference. A 
comb with teeth 2mm wide and 3mm deep and a 
space of 1 mm between them, for example, allows at 
least 10 clones to be loaded on a gel 20 cm wide. 

Another type of slot former is the ‘sharkstooth’ 
comb. The gel is polymerized with a precomb (a 
rectangular piece of spacer material) in place, to 
ensure a flat surface. After polymerization, the 
precomb is replaced by the sharkstooth comb so that 
its teeth, which serve as barriers between the 
samples, are just penetrating the gel. The comb rests 
in place during electrophoresis. 

Sharkstooth combs have the advantage that the 
sample tracks are immediately adjacent, making gel 
reading easier. On the other hand, a perfectly flat 
surface is essential, and introducing the comb needs 
some experience in order to avoid leaks or defor- 
mations of the surface. 

In both cases, a perfect polymerization is impor- 
tant. Therefore, exposure to air should be avoided, 
and the combs should be absolutely proper. A ‘trick’ 
to ensure proper polymerization at the wells 
consists of applying tiny amounts of the APS 
solution onto the comb with a paper tissue, just 
before inserting into the gel (R. Reinhard, personal 
communication 1995). 

Before loading, the wells must be flushed to 
remove unpolymerized acrylamide and urea which 
diffuses out of the gel. The (denatured) samples can 
be loaded with the help of a glass capillary or special 
thin pipette tips. One has to keep in mind that the 
final width of the bands in the gel is also dependent 
on the initial band width and that the concentration 
effect taking place at the gel surface might not be 
strong enough to compensate for dilute samples. 
Therefore, only small volumes should be loaded and 
as quickly as possible to minimize diffusion (and 
renaturation). Properly washed wells with an even 
surface are essential. Loading the sample directly 
onto the surface is better than letting it trickle down 
the well, which causes dilution with the buffer. 


23.2.1.6 Field conditions 

As pointed out above, the influence of the electric 
field on the resolution is severalfold. The optimum 
field strength that gives the minimum band width 
depends on different factors, but increases with 
decreasing gel thickness, which is why in thin gels 
higher electric fields can be used. The empirically 
determined optimum field strength of 30-50 V cm“ 
[34] coincides well with the estimated one [19]. 
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If the gel is not actively thermostatted, the electric 
power also has the task of heating the gel. An electric 
field should be chosen which generates enough 
power to bring the gel temperature to 50-60°C 
(measured on the outside). Too strong an electric 
field, however, will result in overheating, which in 
turn enhances the conductivity of the gel. Current 
and power will increase, leading to even more 
heating and so forth (‘thermal runaway’). Therefore, 
to ensure a constant heat input, the gel should be run 
at constant power rather than at constant voltage. 
For routine sequencing, an active temperature 
control is not necessary, but we recommend the use 
of a metal plate in contact with one of the glass 
plates, to achieve a more even heat distribution. 

For a 20x48 cm gel 0.4mm thick, a power of 40 W 
is sufficient to generate the required temperature. 
This results in a current of about 20-25 mA and an 
electric field of about 1.6-2kV corresponding to 
33-42Vcm". Under these conditions, molecular 
orientation does not occur below fragment lengths 
of 1 kb, which is far above the actual reading limit. 


23.2.1.7 Gradient gels 

As explained above, for resolving large fragments, 
long gels have to be used in order not to loose the 
small molecules. An alternative method is to slow 
down the fragments in the lower part of the gel, thus 
preventing them from being electrophoresed out of 
the gel. This can be achieved with a nonuniform 
electric field across the gel, that is the field in the 
lower part has to be weaker than in the upper part. 

One way of creating such a gradient, is by 
preparing a wedge-shaped gel, which is thicker at 
the bottom (0.6-0.75mm) than at the top (0.25mm) 
[45-47]. In the thicker part, the gel has a lower 
resistance, leading to a lower voltage drop across 
that part of the gel. 

However, there is another way to produce a 
voltage gradient. Increasing the ionic strength by 
increasing the buffer concentration in the lower part 
of the gel also results in a lower resistance and 
therefore a lower voltage drop [48]. Such a buffer 
concentration gradient can be made with two gel 
solutions containing different TBE concentrations 
(e.g. 0.5 and 5x TBE, see Fig. 23.3a). 

A similar effect can be achieved by adding sodium 
acetate to the lower chamber buffer. The salt diffuses 


Fig. 23.3 Schematic illustration showing the principles of 
the different electrophoretic techniques described in the 
text. (a) Slab gel with gradient profile; (b) direct transfer 
electrophoresis; (c) automated gel reader; (d) capillary 
electrophoresis. W, detection window; HV, high voltage 
power supply. 
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into the lower part of the gel, thus increasing the 
ionic strength [49]. 
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Due to the field gradient, the DNA molecules are 
gradually slowed down, i.e. the bands are com- 
pressed and band width and interband spacing 
are reduced. If the gradient is carefully chosen, these 
gels give a nearly even band spacing, thus making 
gel reading easier. Because of these advantages, the 
use of a gradient gel is strongly recommended. 
Which kind of gradient is used remains a matter of 
personal choice. Wedge-shaped gels can be prepared 
with a single gel solution, but they take much longer 
to dry. Without distorting the glass plates, the 
magnitude and flexibility of the gradient remains 
limited. In buffer gradients, on the other hand, 
uniformity across the gel can be difficult to obtain 
and distortion of the lanes can occur. See also 
Chapter 19, Protocol 99 for the preparation of a 
denaturing gradient gel. 


23.2.1.8 End of the run and autoradiography 

In a 6% gel, the bromophenol blue marker approxi- 
mately runs at a rate equivalent to that of a DNA 
fragment 25 nucleotides long and the xylene cyanol 
marker runs equivalent to a fragment of about 115 
nucleotides. Depending on the vector and the 
primer used, the start of the unknown sequence is 
about 50 nucleotides from the 3’-end of the primer. 
Under the conditions described above (20x48x 
0.04cm buffer gradient gel, 40W power input), 
molecules of this length will reach the end of the gel 
after about 4h. 

In classical sequencing gels, radioactive labels are 
used. To detect the bands, an X-ray film is placed in 
close contact to the gel (exposure) and processed 
afterwards. In order to obtain sharp bands, a low- 
energy radioactive source, such as *S is preferred. 
(P generally gives more diffuse bands). To prevent 
quenching of the weak signal, urea has to be washed 
out and the gel has to be dried. 


23.2.1.9 Blotting 

There are some applications where it is necessary to 
transfer the DNA fragments from the gel onto a 
membrane (blotting). These include the nonradio- 
active detection of the sequencing pattern with 
colorimetric [42,50] or chemiluminescent methods 
[51,52] as well as multiplex sequencing [28] (see 
Chapter 20). In the case of sequencing gels, there are 
two main methods in use: electroblotting and 
capillary blotting are ‘off line’ methods, which 
means that after electrophoresis, the gel has to be 
dismantled and the fragments transferred to a 
membrane with the help of capillary forces or a 
transverse electric field. Direct blotting, or direct 
transfer electrophoresis (DTE) is an ‘on line’ method: 
a membrane is moved across the bottom of the 


sequencing gel, and the DNA molecules are immobi- 
lized onto that matrix as they are eluted (Fig. 23.3b) 
(see Protocol 119). 

After the transfer, the membranes can be used for 
hybridization (multiplex sequencing) or can be 
developed to visualize the DNA fragments. If blot- 
ting is used for routine nonradioactive sequencing, 
it might be a good idea to try to automate the 
developing step (see ref. 42, for example) as this can 
become very time consuming. 


Electroblotting and capillary blotting Sequencing gels 
can be blotted in the same way as agarose gels by 
using capillary forces [53]. Electroblotting is much 
faster, but unfortunately, most commercially avail- 
able blotting devices have been built for blotting 
standard agarose or protein gels and are too small to 
accommodate a large sequencing gel. Therefore, 
those interested in this technique might be obliged 
to construct an apparatus on their own. (The 
construction and use of such a device is described in 
ref. 54.) 

The principle is essentially the same as for blotting 
agarose or SDS-polyacrylamide gels; however, 
some experience and extra care is needed in order to 
avoid air bubbles being trapped between the gel and 
the membrane. 

We have successfully used a home-made ‘wet 
blotting’ device. With 0.25x TBE transfer buffer and 
an electric field of about 50Vcm4, transfer is 
complete within 15-20 min. 


Direct transfer electrophoresis More than 10 years ago, 
a new technique was introduced that combines the 
separation and the blotting into a single step. This 
direct blotting electrophoresis or direct transfer 
electrophoresis (DTE) has proved useful for DNA 
sequencing [21,42,55] and the device is also com- 
mercially available. DTE allows nonradioactive 
sequencing at relatively low cost, but some experi- 
ence is needed in order to achieve good results. In 
DTE, the DNA fragments travel the same distance 
but over different time spans (in contrast to normal 
gels); thus they are evenly spaced on the membrane 
over a wide size range [21]. As explained above, in 
this case band spreading is the limiting factor, 
therefore the following strategy should be adopted: 
thin gels (0.1mm) should be used, as they minimize 
the band width. The loading capacity is reduced, 
however, which can be alleviated by the use of 
‘inverse wedge’ gels [42]. An air-bubble-free, 
homogeneous gel with even edges in the wells, as 
well as a clean and even lower edge is also very 
important. The gel is bound to one glass plate with 3- 
methacryloxy-propyltrimethoxysilane [41] to avoid 
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slipping. Longer gels increase the resolution, but 
lead to long run times. As the gels are thinner, 
stronger electric fields can be used to compensate, 
but this might eventually lead to a domination of the 
molecular orientation. 

Good results have been obtained with 30-cm long 
gels and an electric field of 1800 V (60 V cm) [42,55]. 
The optimum speed of the membrane is then 
8-20 cmh”™. Further improvements can be made by 
obeying the theoretical considerations outlined 
above: by using longer gels (60cm) and higher 
electric fields (70Vcm"), but lower gel con- 
centrations (3.5%), sequence readings up to 800 
bases have been achieved [43]. When even higher 
electric fields (100 V cm") are applied, the distances 
between the bands of the long fragments are 
reduced (Fig. 8 of ref. 43), which shows the onset of 
molecular orientation and indicates that the im- 
provements have pushed the method close to its 
inherent limits. 


23.2.2 Automated sequencing gel electrophoresis 


In the past few years automated DNA sequencers 
have been developed [56-61]. This name is 
somewhat misleading, as in fact these devices are 
‘on-line’ gel readers. The principle of slab gel 
electrophoresis remains the same and the gels still 
have to be poured and loaded manually. Therefore, 
what has been said above remains valid, however, 
more care has to be taken when preparing the gel 
and the gel solutions, as dirt, dust and fluorescent 
contaminants in the gel or on the glass plates can 
disturb the detection (see Protocol 120). 

The existing devices are based on fluorescence 
detection. They require reaction products with a 
fluorescent group attached either to the primer, to 
the dideoxy nucleotide analogues or to the deoxy- 
nucleotides. The products are detected directly 
within the gel, by excitation of the fluorescent 
labels with a laser and detection of the signals 
emitted at the respective wavelengths. The laser 
beam can either be directed onto the gel surface (e.g. 
ref. 56) or by ‘side excitation’ [59,62]. During 
electrophoresis, the bands are passing the laser and 
detector, which have a fixed vertical position, 
leading to ‘on-line’ and ‘real-time’ detection (see 
Fig. 23.3c). The bands that have passed the detector 
are no longer needed as they have already been 
processed, and are electrophoresed off the gel. 

For labelling and detection, different strategies 
can be adopted: if different fluorescent tags are used 
for each of the four terminator reactions, all four 
reaction sets can be electrophoresed in the same 
lane. This avoids problems due to possible mobility 


variations in different lanes. However, the different 
fluorophores change the mobility of the DNA 
fragments and the different shifts have to be cor- 
rected by a computer program. 

Alternatively, a ‘one-dye/four-lane’ approach can 
be used, which avoids the need of four different 
labels, but problems of misalignment between lanes 
can occur and computational algorithms are needed 
to compensate [63]. Of course, fewer samples can be 
loaded ona gel of the same size. The advantages and 
disadvantages of the different strategies are discus- 
sed in detail in ref. 64. 

As in DTE, all DNA bands travel the same 
distance but over increasing time spans in propor- 
tion to their size and pass the detector at regular time 
intervals. Again, the band width is the limiting 
factor and all means to reduce band broadening will 
enhance the performance of sequencers with on-line 
detection. 

However, the first generation of commercially 
available sequencing automates did not seem to 
obey these rules and unsurprisingly, their perfor- 
mance was no better than manual sequencing (e.g. 
ref. 3). This is mainly due to technical limitations: 
to obtain a stronger signal, relatively thick gels 
(0.3-0.5mm) are used. Detection systems that scan 
the gel have a long data acquisition time. In order to 
be identified, the DNA fragments have to travel at 
low speed, which means long run times and 
enhanced band width by diffusion. 

Recently, theoretical considerations concerning 
resolution have been confirmed by empirical studies 
(e.g. ref. 19) and instrumental improvements have 
been made. These include the use of thin (0.1- 
0.25mm) or long (50-90cm) gels, stronger electric 
fields (up to 80Vcm"'), reduced laser beam dia- 
meter, faster detection systems (e.g. without filter 
wheels) or simultaneous detection of all lanes 
[5,65-67]. These measures improve resolution and/ 
or speed and therefore the throughput of obtainable 
sequence data. 

Automated sequencers are not suitable for 
multiplexing, unless a sophisticated, multispectral 
labelling technique is developed. 


23.3 Capillary electrophoresis 


During the past 10 years, capillary electrophoresis 
has been developed into a powerful analytical 
method. Separation takes place in a thin fused silica 
capillary, coated with polyimide on the outside. A 
small window in the coating allows detection by 
absorption or fluorescence (Fig. 23.3d). 

Capillary electrophoresis can also be used to 
separate oligonucleotides and DNA sequencing 
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reaction products [68-72]. For labelling and detec- 
tion, the same sequencing chemistries can be used as 
for the automated slab gels (ref. 73; see ref. 74 for 
review). However, despite a number of publications 
concerning this technique it has, to our knowledge, 
not yet been used for a larger sequencing project. 
One obvious reason for the low acceptance of this 
method for DNA sequencing was the lack of a 
suitable commercially available apparatus. How- 
ever, in the meantime, several prototypes have been 
described [75-77], and by the time of writing, one 
device, specialized for DNA sequencing and 
analysis (based on a single capillary system) is on 
the market. There have also been some technical 
problems like the filling of the capillary and the 
stability of the gel [78]. However, the field is 
developing fast, and capillary electrophoresis might 
soon become a real alternative to the conventional 
method (see Protocol 121). 

In principle, capillary electrophoresis has several 
advantages over slab gel electrophoresis. Because of 
the small diameter of the capillaries, heat dissipation 
is very effective and band broadening due to Joule 
heating is minimized. Strong electric fields (up to 
400 V cm) can be used, therefore reducing run time 
and diffusion. Capillaries are available in a variety of 
diameters (about 10-300 zm) and their length can be 
chosen in a wide range. The sensitivity is very high 
and minute amounts of sample can be analysed, 
which probably could make the amplification of the 
template (in vivo or in vitro) unnecessary and offer 
the opportunity of sequencing DNA directly 
isolated from plaques or colonies. 

The biggest advantage, however, is the potential 
for full automation of the separation process for 
DNA sequencing. The samples are injected by pres- 
sure or electrokinetic injection. After each run, the 
separation matrix in the capillary can be replaced by 
rinsing with pressure. Both processes, injection and 
rinsing, can be done automatically, therefore avoid- 
ing the time-consuming tasks of gel pouring and gel 
loading. The commercially available instrument, for 
instance, has a capacity of 48 samples, which can be 
analysed ina fully automated Way. 

However, in contrast to slab gels, only one sample 
can be loaded at a time, which means a total analysis 
time of 125h (140 min separation time, 15min for 
refilling and prerun each) for the 48 samples. This is 
still much longer than the ~10h (2h for gel 
Preparation and loading, 8h for prerun and run), 
needed for the separation of 36 samples in an 
automated slab gel sequencer. 

Therefore it will only be possible to exploit the 
full potential of capillary electrophoresis for DNA 
sequencing if many capillaries in parallel (capillary 


arrays) are used. Several prototypes have been 
described [79-83], but a ‘ready to use’ system is not 
yet on the market. These prototypes mainly differ in 
the way the DNA is detected (sheath flow cuvette, 
confocal system or direct observation), as this is not 
a simple task: the detection must be sensitive and 
fast at the same time. 

Because of the reduced band width, capillary 
electrophoresis has potentially a very high separa- 
tion efficiency (several million theoretical plates per 
meter) and we should expect a better readability 
compared to the automated slab gel systems. So far, 
this has not been achieved. This could be owing to 
the molecular orientation, which becomes a limiting 
factor for separation at high field strengths [5], or to 
enhanced diffusion, which has been predicted to 
occur above a certain molecular size [16]. 


23.3.1 Capillaries 


In capillary electrophoresis (CE), columns made of 
fused silica are used. The surface of untreated fused 
silica is comprised of silanol groups, which are 
negatively charged at any pH above 2. These fixed 
charges are balanced by positive ions in the bulk 
solution, which form a thin sheet of charged fluid 
close to the capillary wall. When an electric field is 
applied, these positively charged ions will move 
towards the cathode, dragging with them the bulk 
solution. This flow is called electro-osmotic flow 
(EOF) and in many applications is used as a ‘pump’ 
for the separation process. In CE separation of DNA, 
the capillary is filled with a polymeric matrix, which 
suppresses this EOF to a certain extent. Therefore a 
coating of the inner capillary surface (to suppress 
the EOF) is not necessarily needed, but it has been 
found that in many cases such a coating is advan- 
tageous for reproducibility and quality of the 
separation. The coatings often consist of polymers 
which are absorbed or covalently bonded (see ref. 84 
for review). A number of coatings (e.g. methylsili- 
cone, phenyl, polyethylene glycol, trifluorpropyl, 
polyacrylamide or polyvinylalcohol) are commer- 
cially available and have been successfully used in 
the separation of DNA. 


23.3.2 Gel matrix 


For separating oligonucleotides and sequencing 
reaction products, the same matrix as in slab gels (i.e. 
crosslinked polyacrylamide) is used. The gels are 
prepared in the same manner by adding the 
catalysts to the monomer solution which is then 
pumped into the capillary, where the polymeriza- 
tion takes place. The capillary is treated before with 
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3-methacryloxypropyltrimethoxysilane, which fixes 
the gel to the inner wall and prevents it from being 
extruded by electro-osmotic forces. Gel concentra- 
tions vary from 3 to 6%T and 3-5%C, and the same 
buffer and denaturants as in slab gels are used. 
Prefilled capillaries have also become commercially 
available. 

However, gel-filled capillaries have several dis- 
advantages. First, the capillary has to be filled 
extremely carefully in order to avoid introducing air 
bubbles. Shrinkage of the gel during polymerization 
can also be a source of bubbles [85]. It has been 
observed that during repeated use, bubbles can form 
at the sample-injection end of the capillary [78,86]. 
Drying out of the gel at the ends can also be a 
problem [78]. However, when carefully handled, 
gel-filled capillaries can be re-used several times, but 
about 50 separations seem to be the maximum. 

A solution to these problems is to replace the rigid 
nonflowable gel with a flowable network, which in 
return requires noncovalent crosslinks. One type of 
such interactions are polymer entaglements: if a 
solution is sufficiently concentrated, the polymer 
chains become entangled, forming a transient 
network that has good ‘sieving’ properties. The first 
separations of short DNA molecules were done in 
polymer solutions such as linear (uncrosslinked) 
polyacrylamide. For separating oligonucleotides and 
sequencing reaction products, solutions of about 
8-10% polyacrylamide are needed [75,87,88]. How- 
ever, such solutions are highly viscous and the capil- 
laries cannot be filled manually or with the existing 
capillary electrophoresis devices [89]. Therefore, as 
in the case of crosslinked gels, polymerization has to 
take place within the capillary. Again, as the capil- 
laries cannot be rinsed or refilled, they have a rather 
limited life and only electrokinetic injection of the 
sample can be used. Recently, the use of polyacry- 
lamide with low molecular weight has been pro- 
posed to circumvent the problem [76,90,91]. These 
formulations contain polymer chains that are small 
enough to give a low viscosity, but long enough to be 
still entangled. Beside polyacrylamide, other poly- 
mers such as polyethyleneoxide (PEO) have been 
successfully used for DNA sequencing [92]. 

Another type of noncovalent crosslinks are 
hydrophobic associations. These can be obtained if 
hydrophobic end groups are attached to hydrophilic 
polymer backbones. When such copolymers are 
dissolved in water, the hydrophobic ends will asso- 
ciate into micelles. Above a certain concentration, 
these micelles will form continuous superstructures 
(‘self-assembling gels’). The mesh size and viscosity 
of such networks can be influenced by choosing 
appropriate hydrophilic backbones and different 


hydrophobic end groups. Such a flowable gel is 
successfully used as a matrix for separating DNA 
sequencing fragments in capillary electrophoresis 
[93]. 


23.3.3 Field conditions 


For DNA sequencing, electric fields between 100 and 
465 Vcmr have been used. Apparently, the read- 
ability does not change very much with the electric 
field, but seems to get worse above 400 V cm". These 
voltages were probably used to obtain a high speed 
rather than long readings. 

Clearly, at such high electric fields, the loss of 
separation due to molecular orientation will occur 
very early, but the exact limit is not yet known. 
Again, there is a trade-off between the reduction of 
diffusional band broadening and the reduction of 
the thermal gradient. The optimal electric field 
strength is also dependent on the fragment size, but 
a value of about 150-250 V cm" has been found to be 
a good compromise [94,95]. Under these conditions, 
up to 300-350 bases can be read in run times of only 
30-60 min (e.g. refs 72,76) and up to 450 bases in 
140min [95]. Recently, separation of sequencing 
fragments over 550 bases in length in about 2h have 
been reported [96]. 


23.4 Final notes 


The theoretical considerations described in this 
chapter concerning readability are only valid for 
an ‘ideal’ sample. It has to be pointed out that the 
quality of the sample, which is dependent on 
different factors like the DNA preparation and the 
sequencing chemistry used, plays an important role. 
In capillary electrophoresis with electrokinetic in- 
jection for example, the sample must be desalted. 
The sequence of the sample itself (GC content, self- 
complementary sequences, stretches of the same 
nucleotide) influences the readability. The existing 
automated sequencers cannot cope with large differ- 
ences in band intensities and are therefore more 
sensitive to the type of chemistry used [97]. (How- 
ever, this problem has been partially circumvented 
recently, with the introduction of new polymerases 
with a more uniform incorporation of nucleotides 
[98]). Finally, the software used for detection and 
‘band calling’ in film readers and automated se- 
quencers might produce errors. 

Scientific and commercial publications often 
claim readings of several hundred bases, mainly 
obtained with M13 DNA, but the average reading 
length in real sequencing projects can be much lower 
[1-3,99]. 
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Protocol 118 


How to set up and run a standard sequencing gel 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix III. 


This protocol describes the preparation and running of a ‘standard’ 
sequencing gel, using a buffer gradient. It is, of course, a compromise 
between the length of the reading which can be achieved, the 
‘robustness’ of the method and the ease of handling. Good and detailed 
descriptions of how to prepare and run a sequencing gel are also given 
in refs 44 and 100. 

For the other techniques described in this chapter, see the instructions 
of the suppliers or the publications given in the references. 


Materials 


° 10x TBE buffer (pH8.8): 162g Tris, 27.5 g boric acid, 9.2 g Na,EDTA; 
make up to 1 litre with H,O 

° 40% acrylamide solution: 380 g acrylamide, 20g bis-acrylamide; make 

up to 1 litre with H,O and dissolve. Add 20g mixed bed resin (e.g. 

Amberlite MB1 or equivalent). Stir carefully for 20min (this step 

removes metal ions and acrylic acid). Store in the dark at 4°C. Always 

wear gloves and work in the hood 

0.5x TBE 6% gel solution: 460g urea, 150 ml 40% acrylamide solution, 

50ml 10x TBE buffer; make up to 1 litre with H,O and dissolve. Filter 

through sintered glass funnel and store in the dark at 4°C up to 

several weeks 

5x TBE 6% gel solution: 115g urea, 37.5 ml 40% acrylamide 

solution,125ml 10x TBE buffer, 10mg Bromophenol blue (optional); 

make up to 250ml with H,O and dissolve. Filter through sintered glass 

funnel and store in the dark at 4°C up to several weeks 

* 25% ammonium persulphate: make up 2.5g ammonium persulphate 

to 10 ml with H,O and store for up to several weeks at 4°c 

TEMED (N,N,N‘,N“tetramethyl-1,2-diaminoethane) 

repel-silane (dimethylchlorosilane solution) 


Method 


1 Thoroughly wash a set of glass plates with detergent and warm 
water. Rinse with deionized water and let them air-dry. 


2 Treat one glass plate (e.g. the notched One) with repel-silane by 
spreading 1 ml of solution with a Paper tissue all over the inner 
surface (work in the fume hood). 


3 Wipe the inner surface of both plates with a few millilitres of 
ethanol using a Paper tissue. 


4 Put the spacers (0.5-1cm wide) onto the edges of one plate and 
assemble both plates. Forma liquid- and air-tight seal on both sides 
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and at the bottom with polyester tape. Take particular care at the 
bottom corners of the plates. 


For a20x50x0.04cm gel use 45 ml 0.5x TBE gel solution and 7 ml 
5 x TBE gel solution. Add to both solutions 2 pl ammonium 
persulphate and 2 ul TEMED for each ml of gel solution. Mix the 
solutions by swirling. 


Immediately, take up about 39 ml of the 0.5 x TBE mix into a 50-ml 
syringe and put it aside. Take up the rest into a 25-ml glass pipette 
fitted with a pipette controller. 


Draw the 5x TBE solution into the same pipette and allow a few air 
bubbles to pass upwards through the gel solutions in order to 
establish a rough gradient. 


Slowly pour the solution into the mould (held at an angle), either 
down one side (easier to perform) or down the centre of the mould 
(gives a more even gradient). 


Lower the mould, pick up the syringe containing the 0.5 x TBE gel 
mix and continue pouring. Control the flow rate by altering the 
angle at which the mould is held. 


Examine the gel for air bubbles. Often, air bubbles can be driven 
out of the gel by lifting the plates and slightly knocking against the 
mould. Alternatively, a thin spacer can be introduced to push a 
bubble aside. 


Lower the mould to a nearly horizontal position and insert the 
comb. Clamp the plates together over the side spacers and leave it 
to polymerize. If some solution remains in the syringe, pour it into a 
reaction tube and close it. This way the polymerization can be 
monitored. The gelification should take place after about 

15-20 min, but polymerization continues for a much longer time. 


When polymerization is completed, wash away any dried gel from 
the outside of the plates and carefully remove the slot former 
(works best under an overlay of water). 


Remove the tape from the bottom and attach the gel to the 
electrophoresis apparatus. Fill the buffer chambers with 1x TBE. 


Denature the samples for 20 min at 80°C. Thoroughly flush the 
wells and immediately load 1-2 ul of the sequencing reactions (see 
Section 23.2.1.5). 


When the gel is loaded, close the apparatus and connect it to the 
power supply (see Section 23.2.1.6 for field conditions). 


At the end of the run (see Section 23.2.1.8), disconnect the power 
supply. Discard the buffer (radioactive!) and remove the gel from 
the apparatus. Peel off the tape and separate the glass plates, using 
a spatula. The gel should stick to the nonsilanized plate. 
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Protocol 119 
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17 Transfer the gel (on the plate) into a 10% acetic acid solution and 
leave it for 15min. 


18 Carefully remove the glass plate with the gel from the acetic acid 
and let the liquid drain off. Let it dry in a nearly vertical position for 
15min. 


19 Lay the glass plate and the gel in a horizontal position, cut the gel 
to size with a ‘pizza cutter’ and place a piece of Whatman 3MM 
paper on top of it. Peel off the paper with the gel stuck to it. 


20 Puta sheet of Saran plastic wrap on top of the gel, trim the whole 
and dry the gel on a gel dryer. 


21 After drying, peel away the Saran wrap, place the gel into a film 
cassette and position a sheet of X-ray film in direct contact with it. 
Expose overnight and develop the film. 


How to set up and run a direct blotting gel 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix III. 


Running a direct blotting gel requires some extra care and experience. 
One point that needs special attention is the lower surface of the gel. 
The gel must be uniformly polymerized, the glass plates must have a flat 
and smooth surface over the whole width of the gel, and they must be 
carefully aligned. Air bubbles between the gel and the membrane must 
be absolutely avoided. We therefore refer the reader to detailed 
protocols [42,43] and the instructions of the supplier [101]. 


Method 


1 Thoroughly wash a set of glass plates with detergent and warm 
water. Rinse in deionized water and let them air-dry. 


2 Treat both glass plates with bind-silance by spreading 1 ml of 


solution with a paper tissue all over the inner surfaces. Let stand for 
10-15 min. 


3 Wipe the inner surfaces of both plates with a few millilitres of 
ethanol using a paper tissue. 


4 Put the spacers onto the edges of one plate but do not assemble 
both plates. 


5 Prepare 4% gel solution by adding 43.3 ml urea diluent (84g urea, 
18.7 g 10x TBE, 73.4g water) to 6.7 ml 30% acrylamide stock 
solution (for other concentrations, adjust volumes accordingly). 
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For a 32-cm long gel, take 17 ml degassed gel solution and add 75 
10% APS and 17 pl TEMED. Mix the solutions by swirling. 


Immediately, take up into a 20-ml syringe and pour the gel by either 
the sliding technique [101] or the clapping technique [43]. 


Examine the gel for air bubbles and ensure parallel alignment of 
the lower glass plate edges. 


Insert the comb. Clamp the plates together over the side spacers 
and leave it to polymerize. If some solution remains in the syringe, 
pour it into a reaction tube and close. This way the polymerization 
can be monitored. Gelification should take place after about 
15-20 min, but polymerization continues for a much longer time. 


Fill the lower buffer chamber with 1xTBE and attach the 
membrane to the conveyor belt. 


When polymerization is completed, wash away any dried gel from 
the outside of the plates and carefully remove dried gel from the 
lower edges without touching the gel itself. Remove the clamps and 
carefully put the plates into the apparatus. Attach an aluminium 
plate to the front glass plate with two clamps. 


Fill 1x TBE into the upper buffer chamber and remove the comb (or 
precomb, respectively). 


Perform pre-electrophoresis for 30 min. For exact conditions see refs 
42 and 101. 


Switch off and disconnect the electrodes. Flush the wells and load 
the samples. (If a sharkstooth comb is used, flush the gel pocket, 
carefully introduce the comb and load the samples.) 


When the gel is loaded, close the apparatus and reconnect it to the 
power supply (see Section 23.2.1.9 for field conditions). 


When the bromophenol blue front is close to the bottom edge, 
switch off the power supply and move the membrane under the 
gel. Set the speed of the conveyor belt to 8-20cmh~. Switch on the 
power again and continue the run. 


At the end of the run, disconnect the power supply and remove the 
gel from the apparatus. Carefully hold the membrane with 
tweezers and detach from the conveyor belt. 


Dry the membrane, crosslink with UV light and expose to film 
(when radioactive label was used) or ‘develop’ with the non- 
radioactive detection system. 
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How to set up and run a sequencing gel for 
automated sequencing 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix II. 


Preparing a gel for automated sequencing requires more care than 


for standard gels, to ensure absence of fluorescence contaminants and 
to exploit fully the theoretically possible longer read lengths. For more 
details see the instructions of the suppliers (e.g. refs 102 and 103). 


Method 


1 


10 


11 


Thoroughly wash a set of glass plates with detergent 
(nonfluorescent!) and warm water. Rinse with deionized water and 
let them air-dry. 


Wipe the inner surface of both plates with a few millimetres of 
ethanol or methanol using a paper tissue. 


Assemble the glass plates according to the instructions of the 
supplier. (Alternatively, use a ‘clapping’ technique.) 


To prepare a 25x48 x0.02 cm gel: to 28.8 g urea add 8ml 40% 
acrylamide solution, 35 ml distilled water and 1 g mixed-bed ion 
exchange resin. (This will result in a 4% gel. Ifa gel of different 
concentration, e.g. 5% or 6%, is needed, adjust the volumes 
accordingly.) 


Stir until the urea is dissolved and filter through a 0.22um pore size 
filter. 


Transfer into a cylinder, add 8ml 10x TBE and pure water up to 
80ml. 


Degas for a few minutes. 


Add 400ul of a freshly made 10% APS solution and 55 ul TEMED. 
Swirl gently and take up into a syringe. 


Introduce the gel solution into the assembled glass plates (or use 
the clapping technique). 


Examine the gel for air bubbles. Often, air bubbles can be driven 
out of the gel by lifting the plates and slightly knocking against the 


mould. Alternatively, a thin spacer can be introduced to push a 
bubble aside. 


Lower the mould toa nearly horizontal position and insert the 
comb or precomb. Clamp the plates together over the side spacers 
and leave it to polymerize. If some solution remains in the syringe, 
pour a small quantity into a reaction tube and close it. This way the 
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polymerization can be monitored. Gelification should take place 
after 15-20min, but polymerization continues for a much longer 
time. 


12 When polymerization is completed (1-2 h), wash away any dried gel 
from the outside of the plates and carefully remove the slot former. 
(When using a sharkstooth comb, remove the precomb and replace 
with the sharkstooth comb.) 


13 Clean the plates and mount the gel into the electrophoresis 
apparatus according to the instructions of the supplier. Fill the 
buffer chambers with 1x TBE. 


14 Attach the heat transfer plate (if available) and connect the 
electrode cables. 


15 Perform a prerun for about 20min. 


16 Pause the prerun, load the samples, and start the run. 


COSHH OHEHSHHSHHOHOHOSHHSSHSOHHHHHHTHSHHSOSSOHHHHHHHHTHHHHHOSHH HOSE HOHSHOEHOESEHSSEHOEBEOED 


How to set up and run a capillary electrophoresis gel 
for sequencing 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix Ill. 


Sequencing with capillary electrophoresis requires only a few prep- 
aration steps as the whole procedure is fully automated. For more 
details see the instructions of the supplier [95]. 


Method 


1 Equilibrate the plastic syringe containing the separation matrix to 
room temperature and install into the pump block. 


2 Install the sequencing capillary according to the instructions of the 
supplier. 


3 Install the glass syringe (serving as a reservoir for the gel). 


4 Push the gel out of the plastic syringe into the glass syringe, either 
manually or with the help of the machine. Be careful not to introduce 
air bubbles. 


5 Fill the vials with buffer and water according to the instructions. 


6 Prepare the samples, cap the sample vials with septa and load the 
autosampler. 
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7 For running parameters (injection time and voltage, run time, 
voltage and temperature) use programmed values or adjust 
according to your needs. 


8 Start the run. 


The samples are then automatically injected by electrokinetic 
injection and separated in the gel-filled capillary. After each run, a small 
amount of gel is automatically pushed out of the glass syringe into the 
capillary, replacing the used gel. 
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24.1 Introduction 


The goal of this chapter is to give a newcomer to the 
field of DNA sequencing an overview of labelling 
and detection methods used. In addition, guidelines 
are provided to help in the choice of a method for a 
sequencing project, along with example protocols. 

For at least 10 years after the original description 
of DNA sequencing by the dideoxy sequencing 
method [1] and the chemical sequencing method [2] 
(see Chapter 22), virtually all sequencing was done 
using radioactive labelling and autoradiographic 
detection. Today, however, a scientist has the choice 
between a number of different labelling-detection 
methods for DNA sequencing. The most commonly 
used methods can be grouped into three categories: 
1 radioactive isotope methods; 

2 fluorescence-based machines [3-6]; 
3 enzyme-linked methods using colorimetric, 
chemiluminescent and fluorigenic substrates. 

Two other methods to detect DNA sequence 
patterns are less commonly used, hybridization [7,8] 
and silver staining [9,10]. 

A typical band ina DNA sequence pattern consists 
of 0.01-0.5 femtomoles (fmol) of DNA; for com- 
parison, a typical band in an agarose gel contains 
10-200 fmol DNA. To detect these minute amounts 
of DNA, some kind of signal amplification mech- 
anism is used by all detection methods. With 
radioactive labelling and detection, the amplifi- 
cation takes place when a few decay events lead toa 
chemical chain reaction in the film which produces a 
visible silver grain. With fluorescence detection, one 
takes advantage of the fact that each fluorescent dye 
molecule can be excited and then emit light many 
thousands of times. 

Finally, enzyme-linked methods utilize the high 
substrate turnover of the enzymes used to obtain a 
signal amplification of several orders of magnitude, 
thereby creating a signal which can be seen by 
the naked eye. Enzyme-linked methods typically 
employ labelling molecules such as biotin [11] or 
digoxigenin [12]. These are detected through bridge 
molecules such as streptavidin or antibodies that are 
labelled with enzymes such as alkaline phosphatase 
or horseradish peroxidase. Upon incubation with 
colorimetric, chemiluminescent, or fluorigenic sub- 
strates, the product molecules of the enzymatic 
reaction give rise to a coloured precipitate, emit 
light, or become fluorescent. 

A variety of different combinations of labels, 
enzymes and substrates have been successfully used 
for DNA sequencing with enzyme-linked detection; 
similarly, a variety of different isotopes and fluores- 
cent sequencers are available for radioactive and 


fluorescent sequencing. This chapter gives an over- 
view of the most commonly used enzyme-linked 
and radioactive methods. Fluorescent methods, on 
the other hand, will be treated only briefly, mainly 
because the initial cost of fluorescence-based 
sequencers will rule out their use for many small 
laboratories. 

In addition, hybridization-based detection meth- 
ods will be discussed. The concept was originally 
developed for methylation studies [7] and later 
extended to multiplex sequencing [8]. Radioactive 
hybridization probes [8] as well as enzyme-linked 
[13-15] and fluorescent [16] detection schemes have 
been used for multiplex sequencing (see Chapter 
20). This section may be of interest even if the reader 
is not considering multiplex sequencing as an 
option, since the use of enzyme-labelled hybridiza- 
tion probes can be a convenient alternative to labels 
like biotin or digoxigenin. 


24.2 Choosing a detection method 


The choice of the labelling and detection method 
depends on a number of factors. These include: 

* local experience and equipment; 

* personal preferences; 

* health concerns; 

¢ time constraints; 

project size; 

regulatory restrictions; 

* reagent and material costs; 

* sequencing strategy. 

The relative importance of these factors will vary 
from case to case, and individual factors will often 
make the choice obvious. To give an example, results 
might be needed very quickly ina laboratory where 
only experience with radioactive sequencing exists, 
making radioactive labelling the method of choice; 
or, as is the case for the author’s laboratory, personal 
preference and experience can tilt the decision the 
other way, towards enzyme-linked methods. 

In less obvious cases, one can give each factor a 
weight and all of the possible methods a score. By 
summing over all of the scores multiplied by the 
weights, a rational choice can be made for the 
method with the higher weighted score. Table 24.1 
compares the major advantages and disadvantages 
for radioactive and enzyme-linked detection and 
may be helpful for this task. 

Unless regulatory or monetary restriction dictate 
the choice of a detection system, personal prefer- 
ences will often play a major role. For best results 
with enzyme-linked detection, good protocols and 
the help from experienced users can be of critical 
importance. The single most determining factor of 
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Table 24.1 Comparison of radioactive and enzyme-linked detection methods for DNA sequencing. 


ee eee eee 


Radioactive Enzyme-linked 

Advantages Advantages 

Established in most laboratories Stable reagents and reactions (can be stored for 
months) 


Reagent costs can be lower than for enzyme-linked detection 


No transfer to membranes required 


Most straightforward method for ‘primer-walking’ 
strategies 


Disadvantages 
Potential radiation hazards 
Training in handling of radioisotopes required 


Unstable reagents and reactions 
Long exposure times, especially with *P and *S 


Usage may be restricted by local and state regulations 


success, however, is often how careful and meti- 
culous the person performing the experiments is. 


24.2.1 Cost considerations 


In many cases, the higher cost of nonradioactive 
detection kits is viewed as a strong argument for 
radioactive detection. For typical chemiluminescent 
detection kits, costs for membranes, buffers, 
antibody—enzyme complexes, substrates, and films 
come to approximately $40-$50 per membrane 
(15x40cm). With 10 clones and an average read 
length of 200-250 bases, this amounts to 2 cents per 
raw base. These costs can easily be reduced by using 
lower concentrations of chemiluminescent sub- 
strates, or by using colorimetric or fluorigenic 
substrates. 

Cost for radioisotopes, on the other hand, can be 
as high or higher. The isotope “P, which is often used 
because of its lower radiation hazards and easier 
handling [17], for example, has a current list price of 
$7 per reaction. For a gel with 10 clones, the resulting 
isotope costs of $70 would be higher than for 
enzyme-linked detection with chemiluminescent 
substrates. For occasional users of radioactivity who 
do not use an entire vial before decay, costs can be 
even higher. 

However, the cost contribution of labelling and 


No training for use of radioisotopes required 

No special licences or work areas for radioactivity 
needed 

Multiple exposures can be obtained within 1-3 h 

Results can be obtained within 2 h of electrophoresis 

Easy switch to label multiplexing for higher 
efficiency 

Very high spatial resolution [29] and multicolour 
detection [64] with colorimetric substrates 

Well suited for users of direct transfer 
electrophoresis [19,20,29] 


Disadvantages 

Transfer from gels onto membranes required 

Reagent costs can be higher, especially with 
chemiluminescent detection 

Background problems can result from bacterial 
contamination of buffers and handling errors 

Additional ‘hands on’ time (0.5-2 h) required for 
detection procedures 


detection reagents to the overall costs in DNA 
sequencing is generally very small; this is generally 
true for both enzyme-linked and _ radioactive 
detection. For most sequencing projects, the total 
cost is typically between $1 and $3 per finished base 
pair, or between 10 and 50 cents per ‘raw’ base. Thus, 
costs of labelling and detection typically contribute 
less than 5-10% of the overall costs in DNA 
sequencing projects. 


24.2.2 Project size 


Another misconception is that enzyme-linked 
detection methods may be appropriate for small 
projects, but not for medium-sized or large projects. 
Quite the opposite is true: within the last three years, 
at least three projects have used different enzyme- 
linked approaches to generate more than 600000 
bases of raw sequence data each. In one case, more 
than 4 million bases of raw sequence for 500 kb 
finished sequence [18] have been generated. Two 
of these projects were run under severe cost 
constraints, illustrating the fact that enzyme-linked 
detection can be cost-efficient. 


24.2.3 Handling and time considerations 


The fact that DNA needs to be transferred onto 
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nylon membranes for enzyme-linked detection may 
keep some from switching to nonradioactive sys- 
tems. However, two different transfer methods can 
make this task very straightforward: direct transfer 
electrophoresis (DTE) [19,20] and ‘contact’ capillary 
transfer (see Section 24.4.9). 

In enzyme-linked detection, the transfer step as 
well as incubations and washes during develop- 
ment can make the overall procedure more time- 
intensive than radioactive detection. However, this 
can be offset by convenience gains. Reactions can be 
done on any bench without special safety precau- 
tions, no radioisotopes need to be ordered and 
disposed, and reagents and reactions can be stored 
without any impact on the quality of results, thus 
allowing for more flexible work schedules. 


24.3 Radioactive labelling 


24.3.1 Isotope choices 


Classically, radioactive DNA sequencing has been 
done with *P-labelled deoxynucleotides. “P has a 
half-life of 14 days; since it is a strong B-emitter 
(maximum at 1.71 MeV [21]), it necessitates the use 
of plastic and/or lead shields to minimize health 
hazards. In addition, the high-energy radiation 
tends to increase the width of bands on autoradio- 
graphs. 

*S-labelled nucleotides [22] have several advan- 
tages over “P. The longer half-life (87 days) and 
weaker emission [21] (maximum: 0.16 MeV) give 
longer shelf lives, allow for storage of sequencing 
reactions for at least one week, and eliminate the 
need for shields. Furthermore, the weaker emissions 
lead to sharper bands and longer sequence reads. On 
the other hand, exposure times are several times 
longer than with “P, and gels have to be dried before 
film exposure. However, fixation in acetic acid / 
methanol mixtures and direct contact of gel and film, 
as originally described [22], is not essential [17,23]. 

Often, the isotope of choice [17] in radioactive 
sequencing is *P rather than *S or “P. The emission 
energy of °P is about sixfold lower than of 2P 
(maximum: 0.248 MeV vs. 1.7 MeV), and the half-life 
is about twice as long [21] (25 days vs. 14 days). It 
therefore offers the same advantages as *S: no 
shielding is required, band patterns are sharp, 
reactions can be stored for a week, and the shelf life 
is longer. However, unlike %S, it does not require 
drying of gels before exposure, and exposure times 
are only about 1.5-3 times as long as with 2P. 
Furthermore, contamination is easier to detect with 
*P than with *S. 


The one major drawback of “P is the higher cost: 


current prices are three- to sixfold higher than for “P, 
and it can exceed the cost of enzyme-linked labelling 
and detection. Furthermore, new licences for 
handling this isotope may be required, and on-site 
storage times before radioactive waste can be 
disposed of as nonradioactive waste are twice as 
long as for “P. 


24.3.2 Incorporation vs. end-labelled primers 


Conventional radioactive labelling procedures are 
based on the incorporation of labelled nucleotides 
into the synthesized DNA strand. Alternatively, the 
primer oligonucleotides can be 5’-labelled with y- 
labelled dNTPs; this is most often done when cycle 
sequencing protocols are used. End-labelled primers 
can often lead to cleaner sequences, for example 
when RNA contaminations lead to false priming 
which results in increased background. - 


24.4 Enzyme-linked detection 


24.4.1 Choosing a label 


In DNA sequencing with enzyme-linked detection, 
biotin and digoxigenin are currently the most 
commonly used labelling molecules. Typically, 
oligonucleotide primers are chemically labelled at 
the 5’-end during or after synthesis. However, the 
use of hapten-labelled ribo- and deoxynucleotide 
in enzymatic labelling reactions has also been 
described [24-27]. Enzymatic labelling protocols are 
especially attractive when the primer is used only in 
one or a few sequencing reactions, for example with 
primer walking strategies. For universal primers, 
on the other hand, the extra cost and/or time to 
end-label oligonucleotides chemically tends to be 
negligible since primers are used many times. For 
biotin as well as digoxigenin, reagents for chemical 
labelling (N-hydroxysuccinimidyl-compounds and 
Phosphoramidites) and enzymatic labelling (ribo- 
and deoxynucleotide) are readily available from 
several commercial sources. In addition, many 
companies that specialize in custom oligonucleotide 
synthesis offer biotin- and digoxigenin-modified 
oligonucleotides. 

In addition to biotin and digoxigenin, fluorescein 
and 2,4-dinitrophenyl (DNP) have also been used 
for enzyme-linked detection of DNA sequence 
patterns [28]. These labels are attractive for label 
multiplexing strategies (see Section 24.4.5 below), 
but their general use may be limited by the 
availability of highly active antibody—enzyme con- 
jugates. 

In our experience, biotin-based detection systems 
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tend to give shorter exposure times than digoxi- 
genin-based detection systems. However, cost and 
convenience considerations may be more important. 
The lowest costs per development can be obtained 
by using biotinylated primers, streptavidin, biotiny- 
lated alkaline phosphatase, and _ colorimetric 
detection as described [15,29,30]. For maximum 
convenience, however, the use of streptavidin— 
enzyme or antibody—enzyme complexes can lead to 
protocols with fewer incubation and washing steps. 


24.4.2 Biotin: one- and two-component systems 


The first publications on enzyme-linked detection of 
DNA sequences used a two-component system: 
membranes were first incubated with streptavidin, 
followed by incubation with biotin-labelled alkaline 
phosphatase. After several washes, the sequence 
patterns were detected by incubation with the 
colorimetric substrate /enhancer combination BCIP/ 
NBT [30]. Background problems with the original 
procedure were reduced by using high concen- 
trations of SDS to block nonspecific binding of 
streptavidin to nylon membranes [29]. For fast 
results, the total time for incubation and wash steps 
can be reduced to less than 30 min [31]. 

Alternatively, one-component systems, that is 
preformed complexes of streptavidin and alkaline 
phosphatase, can be used. This reduces the number 
of incubation and wash steps and therefore the total 
hands on time. The elapsed time, however, tends to 
be longer due to longer incubation times. Further- 
more, 100-fold differences in signal intensity as well 
as limited shelf lifes have been observed with 
different streptavidin—phosphatase complexes (P.R., 
unpublished; C.M. Martin, unpublished). Therefore, 
it is advisable to use streptavidin—phosphatase 
complexes specifically tested for detection of DNA 
sequence patterns. Even for highly active strep- 
tavidin-phosphatase conjugates, sequences within 
10-30 bases of the primer tend to be weaker and may 
not be readable. 

Finally, biotin is sufficiently stable to be used as a 
label in chemical sequencing [32]; this allows the use 
of enzyme-linked detection in chemical sequencing 
as well as in footprinting studies of DNA-protein 
interactions. 


24.4.3 Digoxigenin and other haptens 


Digoxigenin has also been used successfully for 
enzyme-linked DNA sequencing [16,18,28]. Devel- 
opment protocols and results are similar to one- 
component biotin detection systems; in fact, the 
same protocol can be used for digoxigenin detection 


with antidigoxigenin alkaline phosphatase-labelled 
antibodies and for biotin detection with strept- 
avidin-phosphatase complexes [18] (see Section 
24.4.4). Results are similar, although exposure times 
with digoxigenin tend to be longer. 

In addition to alkaline phosphatase-labelled 
antibodies, peroxidase-labelled antibodies to digox- 
igenin in conjunction with enhanced chemilumines- 
cence [33,34] have been used for DNA sequencing, 
allowing for simple protocols for ‘label duplexing’ 
[20]. 

For other labels like fluorescein and DNP, 
detection systems are currently not as widely 
available as systems for biotin and digoxigenin. 
Therefore, the use of these haptens will typically be 
restricted to label multiplexing procedures [20,28] 
(see Section 24.4.5). The main advantage of 
fluorescein and other fluorescent haptens is that 
purification of labelled primers by gel electro- 
phoresis is simplified, since the labelled primer can 
easily be seen by eye during electrophoresis. 
However, precautions to minimize exposure to light 
and photo bleaching have to be taken when 
fluorescent haptens are used. 


24.4.4 Enzymes and substrates 


24.4.4.1 Alkaline phosphatase vs. horseradish peroxidase 
The most commonly used enzyme by far for 
nonradioactive detection of DNA _ sequencing 
patterns is alkaline phosphatase, typically from calf 
intestine (CIP). The high specific activity and the 
large number of available substrates make CIP the 
enzyme of choice. In addition, the high stability of 
CIP enables prolonged signal developments of up to 
several days. 

Peroxidase from horseradish, another enzyme 
which has been used for DNA sequencing, is less 
stable in the presence of substrates. Signal intensities 
with peroxidase-based chemiluminescent detection 
decrease rapidly after 1 h [34]. Therefore, this system 
is less well suited when signal intensities are very 
low, or when multiple exposures are desired. 
Furthermore, film exposure have to be taken as soon 
as possible after addition of peroxidase substrates, 
thus making peroxidase-based protocols less con- 
venient and flexible than alkaline phosphatase- 
based protocols. 


24.4.4.2 Colorimetric substrates 

Use of colorimetric substrates for DNA sequencing 
with end-labelled primers has only been reported 
for alkaline phosphatase [30], not for peroxidase. 
The commonly used substrate is BCIP (5-bromo-4- 
chloro-3-indolyl phosphate; also called X-Phos) with 
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the enhancer NBT (nitro blue tetrazolium). One 
advantage of colorimetric detection is the low 
substrate cost (about $1 per membrane, compared to 
$10-20 for chemiluminescent substrates). Further- 
more, colorimetric detection gives very high reso- 
lution of sequencing band patterns [29]. Colori- 
metric detection is well suited for manual sequence 
reading with a digitizer tablet, or for scanning with 
flat-bed scanners which are less costly than film 
scanners ($800-$1500 as opposed to $10000- 
$30 000). 

For long-term storage of colorimetric sequence 
patterns, efficient washing of membranes after the 
colour development is essential. Washing with 
nonionic detergent solutions for 30min or longer, 
followed by rinsing with water, removes excess 
substrate and enhancer and allows storage for 
several years with minimal degradation of the 
pattern quality [16]. 

Compared with chemiluminescent substrates, 
colorimetric detection has three limitations. 

1 The detection step takes longer, from 1h to 
overnight. 

2 Multiple exposures cannot be obtained. 

3 Removal of precipitated product for reprobing in 
multiplex experiments requires washes with hot 
dimethylformamide, which is highly toxic [35]. 

For typical in-house sequencing projects, how- 
ever, limitations 2 and 3 will not matter, and the 
longer signal development times will be tolerable 
under most circumstances. 


24.4.4.3 Chemiluminescent substrates 
Chemiluminescent enzyme substrates emit light 
after enzymatic modification (therefore, the more 
correct term would be ‘chemiluminogenic sub- 
strates’). The use of chemiluminescent enzyme 
substrates avoids the three limitations for colori- 
metric substrates mentioned above: exposure times 
are fast, typically from several minutes to 1h; 
multiple exposures can be obtained within one or a 
few hours; and the substrate as well as the product 
of the enzymatic reaction can easily be removed 
from the membrane to allow successive develop- 
ments. Furthermore, researchers used to radioactive 
sequencing may be more comfortable with obtain- 
ing results in a familiar form, on X-ray film. 

On the down side, costs for chemiluminescent 
substrates tend to be significantly higher than for 
colorimetric substrates. However, these costs can 
easily be reduced, at least for alkaline phosphatase- 
based detections, by using lower substrate concen- 
trations. Suggested concentrations in most kits are 
close to the K,, of the enzyme, and they can typically 
be lowered by a factor of 2-10. Exposure times will 
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be longer by a similar factor, but this may often be 
acceptable. 

One aspect of chemiluminescent detection which 
can be of interest for high-throughput sequencing 
projects is that the sequence patterns can be detected 
directly by light-sensitive cameras [36]. However, 
exposure times range from several to 30min, 
necessitating the use of rather expensive equipment 
for direct detection. 


Chemiluminescent substrates for alkaline phosphatase 
1,2-dioxetane-based substrates for alkaline phos- 
phatase [37,38] are currently available from many 
distributors under names such as AMPPD, CPD, 
and CPD-Star. The substrates are stable in alkaline 
solution, but become unstable upon dephosphory- 
lation, and eventually break apart, emitting light 
with an emission maximum at 460 nm. In addition, 
the dephosphorylation leads to binding of the 
product molecule to nylon membranes. This binding 
to nylon membranes increases the product half-life 
to several hours, compared to minutes in aqueous 
solution [14,39]. 

The long half-life of the product molecule leads to 
accumulation of product and to an increase in signal 
strength over time. As a result, exposures taken after 
several hours’ preincubation are much darker than 
exposures taken shortly after substrate addition. 
Product accumulation can also lead to band 
broadening, most noticeably when exposures are 
taken the next day. 

Two improved dioxetane substrates for alkaline 
phosphatase, named CSPD [40] and CPD-Star, have 
been developed. The dephosphorylated product of 
CSPD shows a reduced half-life of 40 min on nylon 
membranes, leading to faster exposures and 
reduced band broadening effects [40]. An example 
of a sequencing pattern detected with CSPD is 
shown in Fig. 24.1. 

The newest substrate, CPD-Star, offers signal 
intensities that are ~10-fold higher than the inten- 
sities from CSPD, making it the ideal choice for 
direct detection with light-sensistive cameras. For 
film-based detection, exposure times of less than 
Imin can easily be achieved with CPD-Star; 
alternatively, much lower substrate concentrations 
can be used for exposure times comparable to those 
typical for CSPD or AMPPD. 


Enhanced chemiluminescence for horseradish peroxidase 
Chemiluminescent substrates for peroxidase show 
quite different characteristics from phosphatase 
substrates. Light emission is maximal shortly after 
addition of the substrate and decays rapidly [34]; 
after an overnight incubation, hardly any signal 
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Fig. 24.1 Enzyme-linked 
detection of DNA sequencing 
patterns. Sequencing was done 
from recombinant M13-clones 
with 5’-biotin labelled primer, 
modified T7 DNA polymerase 
and manganese buffers. 
Reactions were separated on 
6% acrylamide wedge gels and 
transferred onto Biodyne A 
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streptavidin, biotinylated 
alkaline phosphatase, and 
CSPD as described [20]. 
Exposure time, 1h. 


intensity is left. This poses some restrictions on the 
timing and number of exposures; more important, it 
limits sensitivity. While the sensitivity of phos- 
phatase-based detections is typically limited by non- 
specific binding of enzymes and other components 
to the nylon membrane, peroxidase-based detection 
is often limited by the total signal intensity. 
However, the signal levels are high enough for 
typical sequencing experiments. 

At least theoretically, peroxidase-based detection 
can be less contamination-sensitive than phos- 
phatase-based detection. Alkaline phosphatase is 
ubiquitous, and buffer contamination by bacteria, 
for example, can lead to very high background 
levels when phosphatase-based detection is used. To 
avoid such problems, many protocols suggest the 
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use of bacteriostatic stock solutions and/or prep- 
aration of buffers shortly before use. 


Exposure considerations Exposure to X-ray film can 
be done directly in hybridization bags, which may 
have been used for the preceding incubations and 
washes, as suggested by several distributors of 
detection kits. However, most hybridization bags 
are thicker than Saran wrap (25-50 1m vs. 11 1m), 
and band sharpness suffers somewhat. This is 
especially noticeable in regions where bands are 
closer together, towards the top of the gel. When 
maximum read lengths are desired, it is preferable to 
wrap membranes in Saran wrap or very thin Mylar 
sheets (optionally on top of a thicker plastic 
backing). In addition, weights on top of exposure 
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holders should be used to insure good contact 
between membrane and film. 


24.444 Fluorigenic substrates 

Besides colorimetric enzyme substrates, fluorigenic 
substrates have been used for a long time in enzyme- 
linked detection methods [41]. Ideal fluorigenic 
substrates show no (or minimal) fluorescence; upon 
enzymatic modification, for example dephosphory- 
lation by alkaline phosphatase, the product mole- 
cules show strong fluorescence. 

A new fluorigenic substrate for alkaline phos- 
phatase, called Atto-Phos, has been introduced 
recently [42] and shows this desirable characteristic. 
Furthermore, the product of the enzymatic reaction 
binds strongly to nylon membranes, thus enabling 
the detection of sequencing patterns [15]. Because 
of the enzymatic amplification, the fluorescence 
intensity is so high that patterns can be seen simply 
by illumination with hand-held UV lamps. Very 
strong sequence patterns can even be seen in normal 
daylight without special illumination. 

While fluorigenic substrates are considerably less 
expensive than chemiluminescent substrates, their 
general use for DNA sequencing is currently limited 
because permanent records cannot easily be 
obtained. Results can be documented by photo- 
graphy under illumination with long-wave UV, but 
only small sections of membranes can be imaged at 
sufficient resolution. Alternatively, detection can be 
done with cooled charge-coupled device (CCD) 
cameras, exposure times for Atto-Phos range from 
one to several seconds. However, only small sections 
can be imaged at sufficient resolution with CCD 
cameras, unless very expensive cameras (2000 x 2000 
pixels or more) are used (see Chapter 13). 

We have observed two other problems with 
fluorigenic substrates. First, the dephosphorylated 
product diffuses somewhat at higher signal 
intensities. This makes bands appear more blurry, 
and also limits the dynamic range. Second, the 
product is difficult to remove completely for 
multiple reprobings, requiring either very long 
washes or alkaline stripping buffers. Alkaline 
stripping conditions, however, lead to DNA loss and 
thus limit the number of successful reprobings. 
Therefore, chemiluminescent substrates will be a 


better choice than fluorigenic substrates in most 
instances. 


24.4.5 Label multiplexing 


To increase overall efficiency of enzyme-linked 
protocols, combinations of several haptens can be 
used instead of single haptens. To give a simple 


example, two sets of sequencing reactions, one with 
biotin-labelled primer and the other with digoxi- 
genin-labelled primer, can be done individually and 
then combined before electrophoresis. After transfer 
to membranes, detection of the biotin- and digoxi- 
genin-labelled sequences can be done successively 
[20]. Compared to using just one label, only half as 
many gels have to be run and transferred, giving 
significant savings in time. Furthermore, costs for 
membranes are halved. 

The efficiency gains can be even higher when two 
differently labelled primers are used in the same 
sequencing reaction. This reduces the number of 
sequencing reactions by a factor of two, simultane- 
ously reducing the costs for sequencing reagents. 
This approach is well suited to cycle sequencing 
protocols. With double-stranded templates, sequence 
from both strands can be obtained in the same 
reaction. However, the distance between the primers 
(the insert size) may need to be larger than the 
desired read lengths to obtain optimal results. 

The concept of ‘label multiplexing’ has been 
extended to the use of four labels per membrane 
[28]. This further increases efficiency and reduces 
costs. However, exposure times with chemilumines- 
cent substrates will increase with the number of 
labels used, unless the total volume of the sequen- 
cing reactions is reduced, for example by ethanol 
precipitation. 

If the same enzyme is used in subsequent 
detections of different labels, the enzyme from 
previous detections must be removed or irreversibly 
denatured. With alkaline phosphatase, this can 
easily be done either by incubation in low-pH 
buffers at room temperature [28], or by inactivation 
with hot buffers containing SDS-EDTA [20]. For 
horseradish peroxidase-based detection, incubation 
in peroxide-containing substrate solutions may be 
sufficient between subsequent detections, since 
horseradish peroxidase is inactivated by peroxide 
[34]. An even simpler approach is the use of different 
enzymes. Then, incubation with different antibody— 
enzyme or streptavidin-enzyme conjugates can be 
done simultaneously, and only the substrate solu- 
tions need to be changed between film exposures 
[20]. 


24.4.6 Detection with oligonucleotide-enzyme 
conjugates 


The idea of using hybridization with oligonu- 
cleotide-enzyme conjugates to detect sequence 
patterns [14] might seem odd. However, the 
convenience and efficiency of such an approach is at 
least comparable to conventional hapten-based 
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approaches owing to several factors. First, strin- 
gency requirements are low and probes are short, so 
that hybridization and posthybridization washes 
can be done at room temperature. Second, hybri- 
dization times can be very short, from 10 to 30min 
with probe concentrations from 0.5-2 no. And third, 
a large number of protocols for the efficient 
preparation of oligonucleotide-enzyme conjugates 
has been described [43-46]. Directions for the 
preparation and use of such conjugates are given in 
Protocol 123. 

While hybridization-based detection can be used 
for universal primer-based sequencing strategies, it 
can also be adapted for ‘primer walking’ strategies. 
In this case, walking primers with a 5’-tag sequence 
are used, and oligonucleotides complementary to 
the tag sequence are used for detection. The tag 
sequences lead to some additional costs per primer; 
however, these costs are typically lower than costs 
for custom labelling with biotin or digoxigenin, and 
are likely to go down further in the future. 


24.4.7 General procedures for 
enzyme-linked detection 


24.4.7,1 Labelling of oligonucleotides 
To introduce labels like biotin or fluorescein into 
oligonucleotide primers for DNA _ sequencing, 
chemical as well as enzymatic methods can be used. 
Chemical labelling is typically used for universal 
primers which are be used many times, while 
enzymatic labelling is convenient if walking primers 
or existing, unmodified primers are to be used. 
Long spacer arms between the hapten and the 
DNA improve signal intensities [47,48]. Longer 
spacers maximize the accessibility of the hapten to 
streptavidin antibodies. Therefore, compounds with 
the longest possible spacer arm should be chosen 
when different choices are available. 


24.4.7.2 Chemical end-labelling 

The most convenient labelling method is the use of 
hapten-labelled phosphoramidites during chemical 
synthesis of oligonucleotides. Phosphoramidites 
labelled with biotin [49,50], digoxigenin, and fluo- 
rescein have been described. 

Labelling during synthesis, however, can be 
expensive. Occasionally, we have also observed 
poor efficiencies, most likely due to prolonged 
storage and use of modified phosphoramidites. 
Therefore, postsynthesis labelling of amino- 
modified oligonucleotides by reaction with amino- 
reactive haptens is often preferred. Most commonly, 
oligonucleotides are synthesized with a 5’-amino 
group on a C6-spacer and reacted with a 10-50-fold 
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molar excess of N-hydroxysuccinimidyl (NHS)- or 
isothiocyanate (ITC)-modified haptens. Modifi- 
cation efficiencies are typically between 40 and 90%. 
Protocol 122 describes labelling with NHS-LC- 
biotin. However, it can be used without changes for 
the labelling with NHS-digoxigenin, fluorescein- 
ITC, ITC-infrared dyes, and other NHS- or ITC- 
modified haptens. 

To label unmodified oligonucleotides with biotin 
or other haptens, enzymatic phosphorylation 
followed by reaction with diaminohexane can be 
used [51]. However, enzymatic labelling with 
terminal transferase (TdT) or polymerases, as 
described in Section 24.4.7.3, is simpler and 
preferable. 


24.4.7.3 Enzymatic labelling 

Introduction of haptens by enzymatic reactions 
before or during sequencing reactions provides a 
convenient alternative to chemical labelling, especi- 
ally for primer-walking strategies. A number of 
different approaches have been described: 

1 labelling by TdT and ribonucleotides [25]; 

2 the use of labelled dideoxynucleotides in Sanger 
sequencing protocols [4]; 

3 end-filling reactions for chemical sequencing [32]; 
4 the incorporation of labelled nucleotides into the 
growing DNA chain [24,26,27]. 

Methods 1 and 4 are most generally useful. With 
method 1, a sufficient amount of primer for several 
hundred sequencing reactions can be generated by 
3’-labelling with terminal transferase. However, the 
primer has to be chosen so that the first nucleotide to 
be incorporated into the growing strand is the 
labelled nucleotide. 

Method 4, labelling by incorporation of one 
hapten-modified deoxynucleotide during synthesis, 
has been used successfully for fluorescein [24,52], 
biotin [27], and infrared fluorescence [26]. To obtain 
consistent results, some protocols use a two-step 
reaction. In the labelling step, the absence of at least 
one deoxynucleotide limits the elongation, leading 
to the incorporation of a single labelled nucleotide. 
In the following elongation reaction, an excess of all 
four unmodified dNTPs is used. This approach can 
be used with modified T7 DNA polymerase [27] as 
well as with cycle sequencing protocols [26]. 


24.4.8 Purification of labelled oligonucleotides 


Purification of oligonucleotides labelled with biotin, 
digoxigenin or fluorescent dyes can easily be done 
by polyacrylamide gel electrophoresis (PAGE) or by 
reverse phase HPLC (RP-HPLC). In PAGE purifi- 
cation, the hapten leads to an apparent size increase 
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corresponding to 1-2 nucleotides over the unla- 
belled oligonucleotide. In RP-HPLC, the hydropho- 
bic nature of the haptens leads to longer retention 
times. An example of an RP-HPLC purification is 
shown in Fig. 24.2; protocols for PAGE purification 
have been described [20,51]. 

While purification often-used primers may be 
advisable, it is not necessary in every case. The main 
negative effect of incomplete labelling is a pro- 
portionate reduction in signal intensity; with typical 
labelling efficiencies of 50% and higher, this can 
often be tolerated. 


24.4.9 Transfer to membranes 


Before enzyme-linked detection can be done, DNA 
sequence patterns need to be transferred from gels 
onto nylon membranes. This transfer can be done by 
capillary blotting, electrophoretic transfer, or direct 
transfer electrophoresis (DTE). 

The two most convenient methods are contact 
capillary blotting and DTE. In DTE, a membrane is 
moved along the lower edge of the gel during 
electrophoresis; DNA is immobilized on the 
membrane as it leaves the gel. Transfer by DTE is 
virtually complete, and DTE gives longer read 
lengths than conventional gels [19,20,29]. However, 
special sequencing machines are required for DTE. 


| 
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Fig. 24.2 Purification of biotin-labelled oligonucleotides 
by reverse phase HPLC. A 5’-amino-labelled 
oligonucleotide was chemically labelled with NHS-LC- 
biotin, and excess biotin was removed by gel filtration as 
described in the text. A quarter of the sample (0.25 ml) of 
the biotinylated oligo was loaded onto a RP-HPLC 
column equilibrated with buffer A (5% acetonitrile, 95% 
100 mM triethyl ammonium acetate). After 5 min, a 
gradient of 0-40% buffer B (65% acetonitrile, 35% TEA) 
over 20 min was started. The biotinylated oligonucleotide 
eluted in the last major peak at 21.26 min. 


Contact capillary transfer, on the other hand, can 
be done without any special equipment and from all 
typically used gel formats. Before electrophoresis, 
glass plates need to be treated so that they can be 
separated easily after electrophoresis, with the gel 
sticking to only one glass plate. This can be achieved 
by using very clean glass plates, treating both glass 
plates with a hydrophobic solution like Sigmacote or 
Rainex, or by treating one glass plate with Sigmacote 
and the other glass plate with bind silane [9]. 

After electrophoresis, glass plates are pried apart, 
leaving the gel on the lower glass plate. A nylon 
membrane, cut to size and prewetted in electro- 
phoresis buffer (TBE), is placed onto the gel, and air 
bubbles are squeezed out (always wear gloves when 
handling membranes!). Two sheets of dry Whatman 
3MM paper are placed on top of the membrane and 
pressed on to eliminate air bubbles. The other glass 
plate is put on top of the Whatman paper, and a 
weight (2-4kg) is placed on top of this glass plate. 
Transfer times are typically one hour, but shorter 
times and overnight transfer has also been used 
successfully. Before development, membranes are 
crosslinked by UV irradiation at 150 mJ cm2. 


24.4.10 Development procedures 


Many protocols for enzyme-linked detection of 
sequencing patterns have been described in detail in 
the literature. Therefore, we will just point out a few 
general considerations for enzyme-linked detection. 
An example protocol which has successfully been 
used for biotin- and digoxigenin-based detection in 
large scale applications [18] is given in Section 24.7. 

Development of sequencing membranes can be 
done in hybridization bags [30], large trays, or in 
large cylinders [29,53]. Accordingly, all the equi- 
pment needed for enzyme-linked detection is either 
a shaker or an instrument to rotate large drums. 
Details of the different procedure have been dis- 
cussed [20]. Each of the three approaches offers 
some advantages, but we have used them inter- 
changeably, typically dependent on what kind of 
equipment was available. 

Overlaps between different membranes or differ- 
ent parts of one membrane during development are 
avoided in all of the above approaches, whenever 
possible. While rolling up membranes tightly works 
well for radioactive hybridizations, it tends to give 
low signal intensities and background problems 
with enzyme-linked detection protocols. This can be 
explained by several factors, such as limited diffu- 
sion on enzymes through membrane pores and 
higher concentrations of detection system com- 
ponents when compared to radioactive probes. 
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One major advantage of large, rotating drums for 
nonradioactive detection is that the wash and 
incubation steps can be easily automated [32]. At 
least one machine for automated nonradioactive 
detection is currently commercially available. With 
chemiluminescent detection, membranes need to be 
taken out of the drums for exposures; with colori- 
metric detection, however, all steps can be done in 
the drums. 

Recently, a variation of the drum approach has 
been described [54]. Membranes are attached to the 
outside of large drums which rotation in slightly 
larger half-cylinders. This method allows for direct 
detection of the sequence patterns by cameras, 
making this scheme very attractive for multiplex 
and other high-throughput projects. 


24.4.11 Adapting sequencing protocols for 
enzyme-linked detection 


Sequencing protocols for radioactive sequencing 
will typically work with minimal or no changes for 
enzyme-linked detection. Sequencing kit protocols 
for end-labelled primers can generally be used 
directly in enzyme-linked detection. Alternatively, 
two-step protocols with incorporation of radioactive 
dNTPs in the first step can be also followed if 
the radioactive nucleotide is replaced by its 
nonradioactive equivalent [30]. 

Slight protocol modifications, however, can 
sometimes be useful to simplify procedures or to 
optimize results. For example, one-step reactions 
instead of two-step reactions can be used [29]. If 
nucleotide mixes are changed, higher dNTP to 
ddNTP ratios are often used to even out band inten- 
sities for long reads, especially if direct transfer 
electrophoresis is used to generate the sequence 
patterns [20]. 


24.5 Fluorescent detection 


24.5.1 Systems for on-line detection 
during electrophoresis 


A number of machines for fluorescent DNA 
sequencing, based on the detection of fluorescently 
labelled DNA molecules during electrophoresis, 
are currently available. These machines eliminate 
manual steps for the visualization of sequence 
patterns completely, and they offer soft-ware for the 
automatic reading of sequence patterns, thereby 
potentially increasing the efficiency of DNA 
sequencing. Furthermore, they simplify the scaling 
up of sequencing projects and are often the method 
of choice for large-scale projects. 


However, the costs of automated DNA sequencers 
as well as operating costs may be prohibitive for 
occasional sequencing needs. This may change in 
the future when the use of automated DNA 
sequencers in central facilities, similar to the 
use of oligonucleotide synthesizers and protein 
sequencers, is likely to become more common. We 
will therefore discuss some aspects of automated 
DNA sequencers briefly. 

Automated sequencers can be divided into single- 
dye [6,55,56] (marketed by Pharmacia, Li-Cor, 
Millipore and others) and four-dye systems [3,4] 
(marketed by Applied Biosystems). Single-dye 
systems use one fluorescent label and four lanes per 
sequencing reaction, similar to radioactive sequen- 
cing. Four-dye systems use different fluorochromes 
for each of the four nucleotides, so four colours 
instead of four lanes are used per sequencing 
reaction. As a result, more sequencing reactions can 
be loaded per gel, currently 36 compared with 10-12 
on single-dye systems. Therefore, four-dye systems 
are typically used when high throughput is 
important, for example in large-scale sequencing 
projects [57]. 

A relatively recent development in fluorescent 
sequencers is the availability of machines which can 
accommodate gels up to 60 cm long. Previously, gels 
around 30cm long were the only option. Longer gels 
lead to significant increases in electrophoretic 
resolution and read length; typically, more than 700 
bases can be resolved to single base resolution on 60- 
cm gels, compared to 350-400 bases on 30-cm gels 
[20,52]. In addition, the better electrophoretic 
resolution of longer gels can also lead to improved 
base-calling accuracy, in particular for the first 400 
bases. 

Longer read lengths can reduce the number of 
sequencing reactions, templates and walking 
primers needed to sequence a given stretch of DNA, 
thus increasing efficiency and reducing costs. How- 
ever, electrophoresis run times tend to be longer, 
and throughput numbers (in bases per hour) tend 
to be lower for longer gels. 

Labelling procedures for single-dye automated 
DNA sequencers are very similar or identical to 
procedures for biotin and other haptens as described 
above. Chemical as well as enzymatic labelling can 
be used, and primer walking strategies can be pur- 
sued by incorporation of fluorescent deoxynu- 
cleotides into the growing chain (see Section 24.4.7). 
Most of the commercially available instruments use 
visible fluorescence; only one instrument uses 
fluorescence in the near-infrared region [56]. The 
main advantages of using the near-infrared region 
are significantly reduced background fluorescence 
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and the availability of cost-efficient, highly reliable 
lasers and detectors. 

Multiple dye sequencing machines, on the other 
hand, are often preferred when maximum through- 
put is important. However, labelling strategies are 
more complicated, since four different dyes are 
used, and mobility differences between the dyes can 
have a negative impact on results if new primers are 
chosen. These problems can be circumvented by the 
use of dye-labelled dideoxynucleotides, but reagent 
costs can be higher than for incorporation strategies 
with a single label. 


24.5.2 Detection after electrophoresis: 
fluorescence scanning 


While fluorescence detection during electrophoresis 
is convenient, the throughput is limited by the speed 
of electrophoresis. Typically, just one or two runs per 
day can be obtained per machine. Scanning after 
electrophoresis, on the other hand, can be signifi- 
cantly faster. Most film scanners, for example, scan 
35x43 cm X-ray films in less than 10 min. Therefore, 
fluorescence scanners could be an attractive alter- 
native, especially in situations where equipment 
resources are shared between different groups, and 
when equipment funds are limited. 

The use of a fluorescence scanner for DNA 
sequencing has been described [58]. A 532-nm laser 
was used for excitation, and glass plates for 
sequencing gels had to be nonfluorescent. For de- 
tection from membranes, a detection limit of 10 
fmol was reported. It has been observed that high 
intrinsic fluorescence of membranes can limit 
detection sensitivity [59]. 

As mentioned above, background fluorescence is 
much lower in the near-infrared region of the 
spectrum. Detection limits of 60 attomoles have been 
obtained with an infrared scanner prototype [16]. 
This sensitivity is two orders of magnitude better 
than in the visible region, and is sufficient to allow 
detection of DNA sequence patterns from nylon 
membranes. This, in turn, allows for the use of DTE, 
enabling longer reads than with conventional gels. 
An example is shown in Fig. 24.3. The sensitivity is 
also sufficient to allow the detection of multiplex 
sequence patterns with infrared-labelled hybridiza- 
tion probes [16]. However, general use of these 
methods is limited at the time of this writing since 
infrared scanners are not yet commercially 
available. We also encountered problems with the 
reproducibility of infrared fluorescence detection 
from membranes, which were tentatively attributed 


to fluorescence quenching by minute amounts of 
contaminants. 


Fig. 24.3 Detection of DNA sequence patterns by 
fluorescence end-labelling and scanning. Sequencing was 
done from recombinant M13-clones with modified T7 
DNA polymerase and manganese buffers, as described 
[20]. The primer was 5’-labelled with IRD40 (Li-Cor, 
Lincoln, Nebraska). Reactions were separated by direct 
transfer electrophoresis onto Biodyne A nylon 
membranes. After drying, the membrane was 
sandwiched between two glass plates and scanned with 


an prototype scanner [16] based on the detection optics of 
a Li-Cor DNA sequencer [56]. 


24.6 Silver staining 


Silver staining of DNA in polyacrylamide gels can 
be sufficiently sensitive to visualize DNA sequence 
patterns [9,10]. No special labelling of primers is 
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required, and costs for chemicals are lower than for 
many enzyme-linked procedures. Covalent fixation 
of the gel to one glass plate is required, and the 
overall procedure takes about the same time as 
enzyme-linked developments, thus being faster 
than radioactive detection. Multiple exposures can 
be obtained by a procedure similar to the taking of 
contact prints from black-and-white negatives. 

However, the amount of DNA per sequence band 
is very close to the detection limit of silver staining 
procedures. The supplier suggests cycle sequencing 
protocols which tend to give intense sequence 
patterns, and results may not be satisfactory with 
weak sequences, especially close to the primer. For 
comparison, the detection sensitivity for chemilumi- 
nescent detection is much higher than required for 
most sequence patterns, and patterns with less than 
2% of the typical intensity can be visualized [20]. 


24.7 Hybridization-based detection 
and multiplex sequencing 


Conceptually, multiplex sequencing [8] is very 
similar to DNA sequencing with enzyme-linked 
detection. In enzyme-linked detection, the sequence 
is labelled by a hapten such as digoxigenin, and 
detected through incubation with a reagent that 
specifically recognizes the hapten, for example an 
antibody—enzyme conjugates; in multiplex sequen- 
cing, a short ‘sequence tag’ is used instead of a 
hapten, and specific recognition is achieved by 
hybridization with a cDNA instead of the antibody— 
enzyme conjugate. In the simplest case, the sequence 
tag can be a universal primer, and the probe the 
reverse complement of the primer, conjugated to an 
enzyme such as alkaline phosphatase. Then, almost 
identical protocols can be used to visualize the 
sequence pattern, as Table 24.2 shows. 

All incubation and wash steps are done at room 
temperature, and identical buffers are used for steps 


3-9 (see Table 24.2). For the hybridization buffer, a 
high salt concentration is used for maximum speed. 
Alternatively, buffers containing 5-7% SDS [7,20] 
and hybridization times of 30-60 min can be used if 
background minimization is more important than 
short hybridization times. Time for steps 1 and 2 
may be increased to 45min for digoxigenin, and 
reduced to 15min for oligo-enzyme conjugates. 
Additional wash steps may be added if the back- 
ground is high. Buffers used are: 

e steps 1 and 2, digoxigenin: 1.5% casein-based 
blocking reagent (Boehringer Mannheim) in maleate 
buffer (1.16% w/v maleic acid, 0.876% w/v NaCl, 
pH adjusted to 7.5 with NaOH); 

¢ steps 1 and 2, oligo-enzyme conjugate: 2% casein- 
based blocking reagent (Boehringer Mannheim) in 
750 mM NaCl, 50 mM Tris-HCl (pH 8); 

¢ steps 3-5: maleate buffer, pH 7.5 (see step 1); 

¢ steps 6-8: 0.1 M diethanolamine-HCl, pH 9.5, 1 mM 
MgCl. 

While hybridization-based detection can be 
efficiently used for the detection of single, non- 
multiplexed sequencing patterns, higher efficiency 
can be gained by combining sequencing reactions 
with different tags before electrophoresis, and 
reading out the individual patterns through succes- 
sive hybridizations. Many different approaches can 
be used to introduce the multiplex sequence tags: 
reactions done with different primers can be 
combined, primers which differ only in a 5’-tag 
sequence can be combined [60], or the DNA of 
interest can be subcloned so that the insert is flanked 
by different tag sequences, either by using dedicated 
multiplex vectors [8], or by using tagged linkers for 
subcloning [61]. 

The more clones are pooled together, or—in other 
words — the higher the multiplex factor is, the more 
efficiency is gained by multiplexing. Similarly, 
maximum efficiency gains can be realized when 
pooling is done early in the protocol. To give an 


Table 24.2 Comparison of protocols for hapten-based and hybridization-based detection. 


Step number Digoxigenin 


Oligo-enzyme conjugate 


1 Block 30 min 


Incubate with 30 ml antibody—alkaline phosphatase 


complex (1 : 5000), 30 min 


3-5 Wash 3 x 10 min, 250 ml 

6-7 Wash 2 x 5 min, 250 ml 

8 Incubate with 30 ml substrate 
(CSPD 0.05 mm) for 5 min 

9 Expose to X-ray film for 10-60 min 


Prehybridize 30 min 
Hybridize with 30 ml oligo-enzyme 
(1 nM) for 30 min at room temperature 


Wash 3 x 10 min, 250 ml 
Wash 2 x 5 min, 250 ml 
Incubate with 30 ml substrate 
(CSPD 0.05 mm) for 5 min 


Expose to X-ray film for 10-60 min 
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example, pooling clones from 20 different libraries 
before the DNA preparation [8] is much more 
efficient than pooling four sequencing reactions just 
before electrophoresis [60]. 

Multiplex sequencing can be done in conjunction 
with all commonly used labelling and detection 
methods: radioactive [8], enzyme-linked [13,15,31, 
60], and—at least in principle—fluorescent de- 
tection [16]. Multiplex sequencing has been used in 
various large-scale sequencing projects [62], for 
example to obtain more than 760000 bases of 
sequence from the genomes of Mycobacterium leprae 
and M. tuberculosis [63]. 

While multiplex sequencing is extremely well 
suited for large-scale sequencing projects, the 
hybridization-based detection can also be the 
method of choice for smaller projects, since 
protocols for hybridization with oligonucleotide— 
enzyme conjugates are as convenient as other 
enzyme-linked detection protocols. A number of 
protocols to conjugate enzymes to oligonucleotides 
have been described [43-46], and numerous 
refinements of the original protocols have been 
developed. Protocol 123 describes a time-efficient 
conjugation and purification method. The protocol 
can be done simultaneously for 2~4 conjugates in 
less than 2h, excluding purification. Yields are 


typically between 10 and 20%, enough for 10-20 full- 
size sequencing membranes. Conjugates are stable 
for at least 6 months when stored at 4°C. 

To summarize, developments in the past 10 years 
have given scientists the opportunity to choose 
between a variety of different labelling /detection 
methods for DNA sequencing. These include 
different radioactive isotopes, automated fluores- 
cent DNA sequencing machines, and a variety of 
enzyme-linked protocols with colorimetric, chemi- 
luminescent, and fluorigenic substrates. Each of 
these methods has been applied to large-scale as 
well as small-scale sequencing projects. Kits and 
machines are commercially available from a number 
of different sources, giving the potential user the 
flexibility to adapt to local restrictions and personal 
preferences. 

A number of other factors which may influence 
the decision for or against a given system have been 
outlined in this chapter, and a few example 
protocols have been given. Admittedly, the views 
presented might be biased by the author’s long 
positive experience with enzyme-linked detection 
and multiplex sequencing. If this is so, it may serve 
to counteract a very common reason for choosing 
radioactive labelling —the ‘we have always done it 
this way’ syndrome. 
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Protocol 122 


Labelling oligonucleotides with NHS- or ITC-haptens 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix Ill. 


Materials 


¢ NHS-LC-biotin (40 mm) (Pierce) 
¢ dimethylformamide 

* sodium carbonate (1 M, PH 9.0) 
¢ Tris-HCl (0.1 mm, pH 8.0) 

° Sepharose G25 column 


Biosystems), or 


Method 


buffer: 100 mm triethylammonium acetate, PH7 (Applied 


TE (10 mm Tris-HCl, pH 8.0, 0.1mm EDTA) 


1 Resuspend deprotected and lyophilized, amino-modified 
oligonucleotide in water to a nominal concentration of 2mm (100 ul 
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Protocol 123 


for a 0.2 umol synthesis; do not resuspend in Tris- or other amine- 
buffers!). 


2 Prepare a fresh 40 mm solution of NHS-LC-biotin in 
dimethylformamide (DMF; dimethyl! sulphoxide can also be used; use 
high-quality, water-free reagents). 


3 In an Eppendorf tube, mix: 45 pl oligonucleotide, 5 ul sodium 
carbonate, and 50 ul DMF. 


4 Add 100 pl NHS-LC-biotin solution. A precipitate may form, but can 
be ignored. Incubate at room temperature for 2-24 h (shorter 
incubation times can be used, but labelling efficiencies may be 
reduced). For fluorescent haptens, incubate in a dark place and 
minimize exposure to light in all steps. 


5 Add 300 ul 0.1m Tris-HCl (pH 8.0), and incubate for 10-60 min. 


6 Purify on a Sephadex G25 column (NAP5 column, Pharmacia). If 
purification by reverse phase-HPLC or polyacrylamide electrophoresis 
is intended, use 100 mm triethylammonium acetate (pH 7.0) as buffer 
and lyophilize eluate. For direct use without further purification, use 
TE as buffer. 
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Preparation and purification of 
oligonucleotide—alkaline phosphatase conjugates 


For details of solutions, media and materials, see Appendix |. For 
suppliers and contact addresses see Appendix Ill. 


Materials 


¢ sodium carbonate (1M, pH 9.0) 

¢ DMSO 

disuccinimidyl suberate (DSS), 25mg ml’ in DMSO (Pierce) 

¢ calf intestinal alkaline phosphatase (CIP) 10mg mI" in 3m NaCl, 1mm 
MgCL,, 0.1 mm ZnCl,, 30 mm triethanolamine (pH 7.6) (Boehringer 
Mannheim) 

¢ Sephadex spin columns (Centri-Sep, Princeton Separations) 

ProteinPAK 300 SW column (Millipore/Waters) 


Method 


1 Dissolve deprotected, lyophilized and amino-modified 
oligonucleotide in water to a concentration of 2 mm. 


2 To 8ul oligo (16 nmol), add 2 pl sodium carbonate (1m, pH 9.0) and 
10 ul DMSO. 
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3 Add 10ul of a fresh solution of DSS to each oligo mix, and incubate 


for 5min. 


4 Add 50 ul water, mix by vortexing. This precipitates most of the 
excess DSS. Spin at maximum speed in an Eppendorf centrifuge for 


2min. 


5 Carefully take off 60 pl and load onto Sephadex spin columns pre- 
equilibrated in sodium carbonate according to the manufacturer's 
instructions. Spin for 3 min at 4000 r.p.m. Recovery of 20-mers is 


about 40%. 


6 Add 25 ul CIP to each oligo, mix by pipetting; incubate at room 
temperature for 2-16 h, then store at 4°C until purification. 


7 Purify conjugate by gel filtration on a ProteinPAK 300 SW column 
using 20 m Tris-HCl, 100 mm NaCl as running buffer. Characterize 
fractions by OD 6999 ratios. Conjugate peaks may overlap with free 
enzyme for oligonucleotides shorter than 20 bases, but good 
separation to free oligonucleotides is generally achieved. 
Alternatively, gel filtration on Biogel P60, ion-exchange 
chromatography, or other methods can be used [45,46]. 
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618 CHAPTER 25 REPEAT ANALYSIS 


25.1 Introduction 


Repetitive elements are DNA fragments that occur 
repeatedly within a genome. This broadest defini- 
tion includes gene families such as rRNA and tRNA. 
Amore limited definition [1] includes tandem arrays 
of repeats (satellites, microsatellites, and minisatel- 
lites), telomeric and subtelomeric repeats, retro- 
posons (short interspersed repetitive elements such 
as L1), medium and low reiteration frequency 
repeats, endogenous retroviruses, and viral retro- 
transposons. As much as 60% of the human genome 
may consist of repetitive elements, even if the less 
inclusive definition is adopted [1]. 

The REPBASE database [2] contains prototypical 
interspersed repetitive elements from primates, 
rodents, mammals, vertebrates, invertebrates, and 
plants, as well as a collection of prototypical simple 
DNA sequences of Alu, L1, MIR and THE repetitive 
elements. REPBASE is available in directory 
‘repository /repbase’ via ftp from ncbi.nlm.nih.gov. 

With the advent of large-scale DNA sequencing, 
the analysis of the newly sequenced DNA for the 
presence of repetitive elements is becoming a 
frequent practice. Basic repeat analysis consists of 
the following three steps (we here assume human 
DNA is being analysed). 

1 Recognition of known repeats. Occurrences of 
known interspersed repetitive elements are dis- 
covered. The DNA sequence is separated into repeat 
and nonrepeat regions. 

2 Repeat subfamily identification. Alu sequences, and 
perhaps other repeats, discovered in the first step are 
assigned to subfamilies. 

3 Recognition of internal repeats. Nonrepeat regions 
that are produced in the first step are analysed for 
local or global repetitions that are either direct or 
inverted. 


The non-repeat regions that remain after the third 
step are used for comparisons against DNA or 
protein data bases in search of biologically intere- 
sting similarities. 

In the following section we follow the three-step 
protocol using PYTHIA programs and REPBASE 
data base. We then review other methods and 
programs for individual steps of the protocol. In the 
appendix to this chapter, we discuss in more detail 
SMPL, a core program of PYTHIA. While we focus 
on the analysis of human genomic sequences, the 
analysis of sequences of other organisms may be 
performed similarly. 


25.2 Repeat analysis via PYTHIA 


PYTHIA is currently available on an electronic mail 
server at the address pythia@anl.gov. To get an 
update on the current status of the server, send to it 
the word ‘help’ in the subject line. 

In the following we describe in detail the three- 
step protocol for repeat analysis using the human 
tissue plasminogen activator gene as an example. 

The body of an e-mail message containing a request 
to PYTHIA consists of sequences in Intelligenetics 
format. An example of input is in Fig.25.1. Although 
in our example we assume that a single locus is 
analysed, it should be noted that PYTHIA accepts 
multiple loci as well. The subject line contains one of 
the keywords: ‘RPTS’ (recognition of known repeats), 
‘ALU’ (Alu subfamily identification), or ‘SMPL’ 
(recognition of internal repeats). 


25.2.1 RPTS: recognition of known repeats 


HUMREP is a file within REPBASE that contains 
prototypical human interspersed repeats. A version 
of HUMREP that is augmented by repeats in oppo- 


HUMTPA 


AAGGAAAAAATAACTGGGTGAGACGTGGACTGTCGAC 


Fi HUMTPA Length: 36594 June 4, 1994 19:29 Type: N Check: 


o CACACAACTGGTGCTGTTACCACCATGGGCGTCTAGTCTGGATCAGTGGTCCTCAGTCTTTTTTGCAC 
CAGGGACCAGTTTTGTAAAGATAGCTTTTCCACGGACAGAGGGAGGGGAGATAGTTTCGGGATGATTCAA 


AAAGGGCAGACTTGTAGTAGAATTCAGTTGCAAGAGGGA\ 
AGGCGCGGAAAAGGCAC1 


8313 


TTGGGGAATCTTAAGGAAAAAATAGAATCTT 


Fig. 25.1 Human tissue plasminogen activator sequence 
in Intelligenetics format. Only the first two and the last 
two lines of sequence are shown here. Intelligenetics 
format requires that every sequence be preceded with at 
least one line starting with a semicolon (the rest of the 
letters in these lines do not matter) and by a locus name 
consisting of a contiguous sequence of letters at the 


beginning of a separate line. The sequence itself consists 
of uppercase letters A, G,C, and Tand ends witha 1. 
Blank lines are not allowed, but empty spaces within the 
sequence itself are tolerated. The maximum length of a 
line accepted by PYTHIA programs is 99 letters. Every 
request consists of one or more sequences. 
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site orientation can be obtained by sending the word 
‘repbase’ in the subject line. The names of repeats in 
opposite orientation are derived by appending ’c’ at 
the end of the original name (e.g. an L1 sequence in 
opposite orientation is denoted L1c). 

In order to search sequences against HUMREP, 
send them to pythia@anl.gov with term ‘RPTS’ in 
the subject line. As an example, in the following we 
assume that we have sent as an input the human 
tissue plasminogen activator (HUMTPA) gene se- 
quence [3], GenBank [4] accession number K03021, 
containing 36594 bases, as illustrated in Fig. 25.1. 
PYTHIA responds with a message consisting of four 
parts, as described below. 

1 Occurrences of repeats. A partial listing for our 
example is in Fig. 25.2. 

2 Listing of Alu regions. A few Alu sequences from 
our example are shown in Fig.25.3. These Alu 
regions can be excised using a test editor and then 
sent directly back to pythia@anl.gov with ‘ALU’ in 
the subject line for Alu subfamily identification, an 
analysis option described in Section 25.2.2. 

3 Local alignment of repeats. A local alignment of a 
repeat fragment from our example is shown in 
Fig. 25.4. There is no rigorous significance theory of 
alignment scores yet; a threshold for significant 
alignment scores is set based on empirical testing. 
An additional test homology, suggested by J. Jurka 
[5], is to count the ratio of transitions vs. total point 
mutations; a significant ratio of about 1:3, the 
expected value for random sequences, indicates true 
homology. 

4 Listing of nonrepeat fragments. Some of the non- 
repeat fragments from our example are in Fig. 25.5. 
The sequence returned by PYTHIA is formatted so 
that it can be immediately mailed to pythia@anl.gov 
with ‘SMPL’ in the subject line for the purpose of 
discovering internal repetitive patterns, or it can be 


ALU : 
HUMTPA 739 1022 
HUMTPA 8862 9165 


HUMTPA 32921 33210 
HUMTPA 34234 34503 


HUMTPA 5671 5960 
HUMTPA 6319 6463 
HUMTPA 6483 6746 
HUMTPA 7224 7512 
HUMTPA 10513 10938 
HUMTPA 11728 11869 
HUMTPA 12700 13143 
HUMTPA 18640 18796 
HUMTPA 21651 21940 


MERI : 
HUMTPA 40 578 
HUMTPA 26355 26380 


MERI2c : 
HUMTPA 17555 17769 


MERIic : 
HUMTPA 301 408 


Fig.25.2 Some occurrences of repeats in the human tissue 
plasminogen activator sequence that are identified by 
PYTHIA. 


directly used for database searches. Searching for 
internal repetitive patterns is recommended because 
it may further reduce uninteresting similarities (e.g. 
see Fig.25.6) produced by the standard similarity 
search algorithms. 
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HUMTPA[739->1022] (0,0) 


AAAAL 


HUMTPA [8862->9165] (0,0) 


GGCGAGACCCTGTCTCAAAACAAA1 


GGCTGGGCGCGGTGGCTCACACCTATAATCCCAGCACTTTGGGAGGCTGAGGCAGGTGGATCACGAGGTC 
GGGGGTTTGAGACCAGCCTGACCAACATGGTGAAACCCCGTCTCTACTAAAATACAAAAAATTAGCTGGG 
CGTGGTGGCGGGCACCTGTAATCTCAGCTACTCAGGAGGCTGAGGCAGGAGAATTGCTTGAACCTGGTGG 
AGGTTGCAGTGAGCCGAGATCACACCACTGCACTCTAGCCTGGGCGACAGAGCAAGACTCTGTCTCAAAA 


GGCCGGGCACACAGCTCCTGCCTGTAATCCCAGCACTTTGGGAGCCCGAGGTGGGCGGGTTGCTTGAGCC 
AAGGAGTTTGAAACCAGCCCGGGTCTTGAACATAGCGAAGACTCTGTCTCTACAAAAAAATGAAAAAAAA 
AAAAAAATTAGCCAGACATGGTGGCACGCACCTGTAGTCCCAGCTACTTGAGAAGCTGAGGTGAGAGGAT 
CACTTGAGCCAGGGAGGTTGAAACTGCAGTGAGCTGTGATCACGCCACTGCACTCCAGTCTGGGTGACTG 


Fig. 25.3 Partial listing of Alu regions. 
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score: -44 


top: locus: ALUc beginning: 150 end: 290 length: 141 
bottom: locus: HUMTPA beginning: 11728 end: 11869 length: 142 
local_indels: 1 mismatches: 39 transitions: 25 


@160 @170 @18s0 @190 @200 @210 


Bio 9 Rn Hee AO 6 II He eo 6 ee RAK oS Sook kkk be Si oeis, ie dork, Rak je) de SI aeae oe ese oe ee SSE Ae ee 
| CCAGG AAAATATTCTATTCTTTTGAAGACATGGGGTCTTGCTATGTTGCCTAGGCTGGTCTTGAACTCCT 
@11730 @11740 @11750 @11760 @11770 @11780 @11790 


@220 @230 @240 @250 @260 @270 @280 
| GACCTCAGGTGATCCGCCCGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCGCGacccaG 


| wR RRR AAA Hie Bo KRSM BBS ek OM WS LES] o) a AR SR CUREN STOR RNR PEL GET tel oe Pe ene eC eee ee 


CGCCTCAAGTGATGCTCCTGCCTCAGCCTCCTGAGTAGCTAGGACTACAGGTGCAAACCACCACACCCAG 
@11800 @11810 @11820 @11830 @11840 @ 11850 @11860 


Fig. 25.4 A local alignment of a repeat region. Note that significantly exceeds the expected value of 1:3, thus 
the ratio of transitions to total point mutations providing additional evidence for true homology. 


HUMTPA[1->738] (0,0) 
| PICACACAACTGGTGCTGTTACCACCATGGGCGTCTAGTCTGGATCAGTGGTCCTCAGTCTTTTTTGCAC 
| CACGGACCAGTTTTGTAAAGATAGCTTTTCCACGGACAGAGGGAGGGGAGATAGTTTCGGGATGATTCAA 
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HMTPA[1023->8861] (0,0) 
MAMA GARAATARBARAABAAAAACAAGTTTCTTGCCCACTCTICCTTTCTCTGAGTTTCCAGAGACAT 
CACATCATTTCTTACCCAGCTGAGCAGAGTCCCAGCATGECTCTCGTTCGAATACCCATCCTGCCACCTG 


Fig. 25.5 Asample of nonrepeat fragments. Loci names (1 means reversed) while the second 0 means that 

are augmented by fragment location: HUMTPA [1023 = sequence is not complemented (1 means complemented). 
> 8861] (0,0) denotes subfragment 1023-8861 within The same fragment in Opposite orientation would be 

locus HUMTPA in direct orientation; the first 0 within denoted HUMTPA [1023 => tSMSHSalel)) ((Ljetile)y 

the parentheses means that sequence is not reversed 

25.2.2 ALU: subfamily identification of nostic Positions within the Alu sequence. The diag- 
Alweequences nostic positions are important for subfamily 


identification, which is described next. 
In order to identify subfamily membership of Alu 2: Alu subfamily identification. The bases in diagnostic 
sequences, send them with the word ‘ALU’ in the _ positions determine Alu subfamily membership. 
subject line. In the following, we assume that the Alu The output of PYTHIA describes the identification 
sequences obtained in the previous section are sent —_ procedure; an example of the output is in Fig. 25.8. 
as the input. PYTHIA responds with a message Alu subfamilies and diagnostic positions are 
consisting of two parts, as described below. described in more detail in ref. 6. The method that 


1 Alignment of Alu sequences against Alu consensus. was used for discovering Alu subfamilies is des- 
An example of an alignment is shown in Fig.25.7. _cribed in ref. 7. 
The alignment enables exact localization of diag- 
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PCSSFOKeKeeroSBHeeSEESeeHHSEESeeoeHSBETESOS 


ATTAACAGTAACTGCTTCATAGATAGA-AGATAGATAGATTAGATAGATAGATAG 


KK TKK TKK KD RK KK TKK KKKEKK KKK KKKKKIN 
Fig. 25.6 Sequence similarity due 
to shared repetitive structure. 


KKKKKKKKKKK Ok 


ATAGACGGTAGATGGATGACAGATAGACAGAT-GATAGGT--GATAGATAGAT-G 


score: 186 

top: locus: CONSENSUS beginning: 
bottom: locus: 
local_indels: 


1 end: 289 length: 
HUMTPAt739->1022] (0,0) beginning: 
8 mismatches: 23 transitions: 22 


289 


@10 @20 @30 @40 @50 @60 
GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACCTGAGG 
KEK DKK KK KEK KKK KKK KKK KDR KK DKK KKK KKK KKK KK KKK KKK KKK DKK KKK DKK DKK KEE 
GGCTGGGCGCGGTGGCTCACACCTATAATCCCAGCACTTTGGGAGGCTGAGGCAGGTGGATCAC- -GAGG 

cum) @20 a30 a40 @50 @60 


KKK 


@210 @220 a230 @240 @a250 @260 @270 
GAGGCGGAGGTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGT 
ete eee ee ee ee ee ee ee ee ee ee ee ee ee ee 
--GGTGGAGGTTGCAGTGAGCCGAGATCACACCACTGCACTCTAGCCTGGGCGACAGAGCAAGACTCTGT 
@210 @220 a230 a240 @250 @260 @ 


1 end: 283 length: 283 


@70 


270 


@280 
CTCAAAAAAA 
KKK RRR RK 
CTCAAAAAAA 
@280 


Fig.25.7 An alignment against Alu consensus. 


25.2.3 SMPL: recognition of internal repeats 


Even after small occurrences of known repetitive 
elements are eliminated from a genomic sequence, 
the sequence may still contain internal repetitions of 
different kinds: satellites, microsatellites, and mini- 
satellites consisting of tandem arrangements of 
short oligomers (see Chapter 5); self-complementary 
sequences; pseudogenes that arose by gene dupli- 
cation; or perhaps two or more occurrences of as yet 
undiscovered repetitive elements in the same or in 
the opposite orientation. Such structures can be 
recognized by sending the sequences with the word 
‘SMPLU in the subject line. 

The SMPL program, the format of its output, and 
the method for establishing significance are dis- 
cussed in more detail in the appendix to this chapter. 
The output of SMPL consists of five parts, each 
corresponding to a combination of the following 
two criteria: that repetitive structures can be local 
(within a window of 128 bases) or global (perhaps 
even occurring across different loci), and that 
repetitions may occur in the same or in opposite 
orientation. 

1 Simple regions. Regions consisting of repetitions of 


words are identified and parsed, as shown in 
Fig. 25.9. 

2 Listing of complex regions. The sequence that 
remains after the regions identified in step 1 are 
‘censored’ may now be used for standard database 
search. 

3 Local reverse complementarity. Fragments that con- 
tain a significant number of long words in both 
orientations are detected. If a long word is identical 
to its own reverse complement, it may be detected. 
This kind of symmetry may be a simple consequence 
of tandem repetitiveness (e.g. (AT) * Nsequence), 
or it may indicate the presence of secondary 
structure in the transcribed RNA or in DNA. 
Regions detected in our example are listed in 
Fig. 25.10. 

4 Global repeats. Repeats can be due to duplications 
of large genomic segments or due to the presence of 
repetitive elements. Figure 25.11 contains the output 
for our example. Note that SMPL can analyse 
multiple loci at once, thus serving as a method for 
fast identification of repetitive structure in data 
bases of DNA sequence. 

5 Global inverted repeats. The list of global inverted 
repeats in our example is empty, but if we did not 
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SPSHCSOARSHSHSHSHOHHTESEHTHEHOGSESHSHSCSOHEESOTOS SSS HED HEEHHEEEHTSOHSASEORY 
® 


pos: S7 65 0" o7t 


weight: 5) 8 o 9°) 210 2 Sie 12 


pos: 78 88 
Sbo: cs 
Spaqx: * * 
Weight: 36 28 5 3) ls 2 x 


pos: 155 244 262 272 
Spq: 
Sx: * * * * 


Weight: 2 2 4 38 


locus contains an Alu-Sx 


An Alu sequence is identified by performing Alu 
a series of decisions as illustrated J #\ 
on the right. Each decision leads to J iS} 
the placement of the sequence in a more om < /# k 
specific subfamily. As an example, the decisions ran en 
leading to the identification of an Sbl sequence = nee 
are marked by '#'. (SbO denotes the members aa. / 
of Sb that are neither Sbl nor Sb2.) Scoenieco a a 
LEX. 
Sb0 Sb1 


Fig.25.8 Alu subfamily identification. 


; 41 HUMTPA 7103 7230 
HUMTPA [7103=>7230] (0,0) {41}lself 


T-C-T-T-C-C-G-A-T-A-G-T-G-G-C-T-C-A-G-T-T-T-T-C-T-A-C-T-T-A-C-A-T-A-A- 
A-A-A-G-A-C-A-G-C-A-C-A-T-T-C-T-C-T-T-A-G-C-A-A-T-A-T-G-T-G-T-T-T-G-_T- 
A-T-G-TGTGTGTGTGTGTGTGTGTGTGTGTGT-A-TATATATATATATATATATATA-A-T-T-T-Al 


; 8 HUMTPA 21251 21278 
HUMTPA [21251=>21278] (0,0) {8}lself 


/ 


; 42 HUMTPA 23879 24006 
HUMTPA [23879=>24006] (0,0) {42}1self 


0 


A-A-G-A-A-A~A-A-G-AAAAGAAAAGAAAA-A-A-T-T-A1 


A-A-T-A-C-A-G-G-A-T-G-G-A-T-A-GATGGATAGATG-A-T-A-G-A-C-A-G-A-T-A-ATAGA 
TGATAG-G-T-GATAGATGATAGA-T-TGATAGATGATAGAT-GATAGGTGATAGAT-T-A-G-A-T-a- 
AATAGATGATA-C-A-T-A-C-ATGATAGAT-A-G-A1 


Fig. 25.9 Simple regions. Dashes are inserted to indicate 
parsing, as described in the appendix to this chapter. A 
locus name is augmented by the redundancy information 
(the method for computing encoding lengths and 
redundancy is explained in the appendix): HUMTPA 


eliminate two occurrences of Alu sequences in 
opposite orientation, they would have been detected 
at this point. While the occurrences of a repeated 


[703 eee 2)2 Oil an OO) {41} | self means that 
the fragment can be encoded in 41 bits less than required 
by the strainghtforward encoding (2 bits per letter), 
indicating repetitions of words within the fragment at the 
significance level 2+". 


region in identical orientation may be explained by 
gene duplications, occurrence in opposite orient- 
ation is more likely to indicate the presence of a 
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; 21 HUMTPA 7167 7230 HUMTPA 7167 7230 

HUMTPA [7167=>7230] (0,0) {21} 1HUMTPA[7167=>7230] (1,1) 
G-T-T-T-G-T-A-T-G-T-G-T-G-T-G-T-G-T-G-T-G-T-G-T-G-T-G-T-G-T-G-T-G-T-G- 
TATATATATATATATATATATATA-A-T-T-T-Al1 

; 21 HUMTPA 7199 7262 HUMTPA 7199 7262 
HUMTPA[7199=>7262] (0,0) {21}1HUMTPA[7199=>7262] (1,1) 
G-T-G-TATATATATATATATATATATATA-A-T-T-T-A-G-A-G-A-C-A-A-G-G-T-C-T-G-A-C 
Huo CAS GA-C-C_C-A-G-G_Ga1_-G—GL 

; O HUMTPA 7935 7998 HUMTPA 7935 7998 

HUMTPA [7935=>7998] (0,0) {0} LHUMTPA[7935=>7998] (1,1) 
A-G-G-A-T-T-G-A-T-C-A-G-A-A-G-A-T-C-T-G-A-T-T-C-C-ACCTGGA-G-C-C-T-C-T- 
GAAGTGATCACTTC-C-A-G-G-T-T-A-G-G-C-T-G1 

; 2 HUMTPA 27229 27292 HUMTPA 27229 27292 

HUMTPA [27229=>27292] (0,0) {2} 1HUMTPA[27229=>27292] (1,1) 
T-A-C-A-T-A-A-ATATGTGT-G-T-G-G-G-T-G-T-G-T-G-TATATATATA-T-G-T-A-A-T-AC 
ACATAT-A-T-TAAATTTA-T-A-T-Al 

; O HUMTPA 32792 32855 HUMTPA 32792 32855 

HUMTPA [32792=>32855] (0,0) {0} 1HUMTPA[32792=>32855] (1,1) 
D-A-A—A-C_T-G-C-A-—GGAAAT ITC C—C-C-A-G-G-A=))-C-_-G—-C-A-C-A-G_-C—C—A-A-G— i 
=€=C-A=-CCTGTACAGG-ATTTCCT—T-T-C-Aill 


at the significance level 2°. The words that are not 
interrupted by dashes occur in opposite orientation 
within the same region, as determined by the parsing 
procedure that is described in the appendix to this 
chapter. 


Fig. 25.10 Local reverse complementarity. Locus name is 
augmented by the encoding information: 

HUMTPA [7167 =>7230] (0,0) {21} |HUMTPA[7167 = 
>7230] (1,1) means that 21 bits can be saved by 
encoding the fragment 7167-7230 relative to its own 
reverse complement, indicating reverse complementarity 


; 36 HUMTPA 25479 25618 HUMTPA 3071 3326 

HUMTPA [25479=>25618] (0,0) {36} 1HUMTPA[3071=>3326] (0,0) 
ATTCCAG-T-CCACAC-CTTGTCA-A-T-T-T-GGCACC-C-A-TGTGCATC-TCCTT-AAACC-ATCCT 
T-CACCTCC-A-A-G-TAAACAC-A-G-G-A-ACAAA-A-T-C-A-T-A-C-TCCTGCCT-A-A-C-A-T 
-G-A-TAGAA-CTACC-AGTGT-A-CAACC-A-A-A-A-A-C-G-CACTCCC1 


; 84 HUMTPA 24583 24838 HUMTPA 5375 5630 

HUMTPA [24583=>24838] (0,0) {84} 1HUMTPA[5375=>5630] (0,0) 
CAGGA-G-G-TCCTGAGGACAT-G-T-G-C-CCAAGGT-TGTCAG-G-G-C-A-C-A-G-C-T-T-GCCT 
TT-A-G-A-C-G-T-T-T-T-AGGGAG-T-CATGAGACAT-C-AATCAA-CATGTG-T-G-AGATGT-A- 
C-A-TCGGT-TTGGT-C-G-GGAAAG-T-T-G-G-G-A-T-AACTCGAAG-C-A-A-GGGCTTCCAGG-C 
~CATAGGTAGATAAGAGA-C-A-AAAGGC-T-G-T-A-TTCTGAGTC-C-T-T-G-A-TCAGC-T-TTTC 
ACTGAA-C-ACACAATT-GAGTCT-G-G-C-T-C-A-G-TTCAT-C-T-G-C-A-T-T-TTTACATA-A- 
A-A-Al 


Fig. 25.11 Global repeats. Locus name indicates regions fragment 3071-3326 in direct orientation, indicating 


compared and gives relative encoding length: HUMTPA 
(25479 =~ 25608) (07 0) N36)" |) RUMEPA 
[3071 = > 3326] (0, 0) means that 36 bits can be 


saved by encoding the fragment 25479-25618 relative to 


repetitive element, which typically integrates in 
either orientation. This is how many repeats from 


REPBASE have been found. 


similarity at the significance level 2°. The words that are 
not interrupted by dashes occur in both fragments and 
are determined by the parsing procedure that is 
described in the appendix to this chapter. 


25.3 Repeat analysis by methods 
other than PYTHIA 


Many methods for repeat analysis have been pro- 
posed and implemented. In the following, we 
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discuss a representative but not comprehensive 
sample of such programs, grouping them according 
to the three steps of our repeat analysis protocol. 


25.3.1 Recognition of known repeats 


The CENSOR program [8] and GenQuest server [9] 
also provide searches against REPBASE. The 
XBLAST program [10] provides a search against a 
data base of Alu sequences. 

For the purposes of comparison, the same se- 
quence that was sent to PYTHIA with ‘RPTS’ in the 
subject line (as described in Section 25.2) was sent to 
the GenQuest server [9]; the response is in Fig. 25.12. 
While all the Alu occurrences in direct orientation 
were correctly identified by both programs, only 
PYTHIA identified all the occurrences of Alu in 
r ‘se orientation. GenQuest also missed a number 
of occurrences of other repetitive elements that are 
not listed in Fig. 25.2 due to lack of space. In some 
other cases, GenQuest may also detect elements not 
recognized by PYTHIA. Rather than indicating 
superiority of one program over the other, this single 
example only illustrates the fact that best analysis 
may be performed by combining the independent 
results of a number of programs. 

One should mention that occurrences of Alu 
elements can be censored prior to sequence simi- 
larity searches using XBLAST [10]. Unlike PYTHIA 
and GenQuest, which both employ alignment 
algorithms that tolerate insertions and deletions, 
XBLAST is essentially a parser of BLAST output 
and inherits its insensitivity to these two kinds 
of mutations. Since repetitive elements are subject 
to insertions and deletions, the methodological 
disadvantage of the latter method is clear, espe- 
cially in the case of older and more decayed 
elements. 

The sensitivity of CENSOR and RPTS are com- 
parable. However, CENSOR is more than a hundred 
times faster than the current inefficient imple- 
mentation of RPTS. Preparations are in progress to 
either improve RPTS or replace it with CENSOR. 

In addition to the methods discussed so far, a 
number of other methods and programs for analysis 
and visualization of repeats and for large-scale 


sequence comparisons have been proposed [11-— 
14]. 


25.3.2 Repeat subfamily identification 


The only programs other than PYTHIA for identi- 
fying Alu and other repeat subfamilies are available 
from Jerzy Jurka (Gjurka@gnomic.stanford.edu). 


>ALU 

ten =) 290 

forward strand 
Fouktoeae= ais) Sel allopiltss 
Taltee2' 203 8862—- 9165 
hit 16: 235 32922-33189 
loess wlgen 186 34234- 34464 
reverse strand 
date eles 234 21661- 21941 
Tastes aes 153 12863- 131/44 
asic ei 167 10680- 10939 
haste 4s 164 7259= 7514 
hate 5s 93 6506- 6747 
hit 6: PAPAPD 5695- 5961 
>MER1 

Ven ="539 

forward strand 
ate us 539 40- 578 
>MER12 

len = 240 
reverse strand 
alten ales ANT 17556- 17786 


Fig. 25.12 Some occurrences of repeats in the human 
tissue plasminogen activator sequence that are identified 
by GenQuest. 


25.3.3 Recognition of internal repeats 


Simple sequences that have biased sequence com- 
position can also be detected by the SEG program 
[15,16], which is based on the concept of com- 
positional complexity [17,18], and is currently also 
available individually or as ’- filter seg’ option in 
BLAST. Program XNU [10,19] finds short tandem 
tepeats by performing self-alignment. XNU is also 
available as XBLAST postfilter of the BLASTN 
program. 

SEG [15,16] works well in cases with repetitive 
pattern bias in the base frequencies (as with poly(A) 
tails), but it fails to recognize longer tandem repeats 
of balanced base composition. XNU and XBLAST 
[10,19] partly rectifies this problem by performing 
self-alignments, a computationally expensive opera- 
tion that may discover repeats of longer periodicity. 
The SEG and XNU programs have been judged 
complementary, each being sensitive to a particular 
kind of repetitiveness [20]. As described in the 
appendix, SMPL requires only a linear time com- 
putation, in contrast to the quadratic pattern, 
irrespective of its periodicity. SMPL exactly mea- 
sures redundancy in terms of the number of bits, d 
which directly implies significance of 2-4, 


7 
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In addition to the methods discussed so far, a 
number of other methods and programs for analysis 
and visualization of internal repetitive structures 
have been proposed [21-24]. 

Finally, we should mention Sequence Landscapes 
[25], a pioneering program for analysis of internal 
repetitive structures, still rarely superseded in its 
generality and clarity. The directed acyclic word 
graph data structure [26] that underlies Sequence 
Landscapes has also been employed in the PYTHIA 
SMPL program. 
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Appendix: encoding and parsing 


In the following we discuss in more detail the 
method employed by SMPL. We also present the 
method for determining significance of patterns 
discovered by SMPL. 

The SMPL program is based on the general 
premise that the process of inference and pattern 
recognition can be viewed as a search for concise 
encoding of data; for a general argument in support 
of this premise, see ref. 27. Every aspect of the 
analysis of repetitive DNA elements perfectly fits 
this premise. In fact, some of the currently accepted 
Alu subfamilies [28] were discovered by computing 
concise encodings [6, 7]. 

Repetitive patterns can best be defined via encod- 
ing length: a completely random DNA sequence that 
does not resemble any other known sequence 
requires 2 bits per letter; however, if a sequence 
contains long repeated words, then the repeated 
occurrences can be replaced by short pointers to 
earlier occurrences within the same sequence, thus 
reducing the total number of bits. The newly 
developed algorithmic significance method [29] 
states that d bits of information can be saved with 
probability 2“. In other words, a random sequence is 
unlikely to be compressed by chance. 

The problem of determining significance of simi- 
larities between objects is best captured using the 
concept of algorithmic mutual information [30], 
which is defined as the difference between the sum 
of individual encoding lengths of objects and their 
joint encoding length. This general formulation 
enables search for sequences of low complexity that 
exhibit enough similarity to prove their homology. 
This kind of analysis may be applied to track 
mutations that abound in sequences of low com- 
plexity. Figure 25.13 contains one example obtained 
by the analysis of the 66495 bp human genomic 
locus that contains growth hormone and chorionic 
somatotropin (HUMGHCSA) genes [31], GenBank 
Accession No. J03071. The approaches that rely on 


‘censoring’ [8, 19] of simple DNA sequence would 
clearly miss these homologies. 

We now turn to the encoding algorithms em- 
ployed by the SMPL program. SMPL and the algo- 
rithmic significance method are described in detail 
in two earlier papers [29,30]; for the sake of com- 
pleteness, some of the material is reviewed here. 

The number of bits needed to encode sequence t 
by itself is denoted by IA,(t). The encoding length of 
sequence t relative to sequence s is denoted I,(t1|s). 
An encoding of a sequence can in either case be 
represented by a parsing, which we describe below. 

When encoding a sequence by itself, a repeated 
occurrence of a word is replaced by a pointer to its 
previous occurrence within the same sequence. We 
assume that a pointer consists of two positive 
integers: the first integer indicates the beginning 
position of a previous occurrence of the word while 
the second integer indicates the length of the word. 
For example, sequence 
AGTCAGTTTT 
may be encoded as 
AGTE (LPS) (793)% 
where (1,3) points to the occurrence of AGT from 
position 1 to position 3, and (7,3) points to the 
occurrence of TTT from position 7 to position 9 in the 
original sequence. 

The decoding algorithm A, consists of the 
following two steps: 

1 replace each pointer by a sequence of pointers to 
individual letters; 
2 replace the new pointers by their targets in the 
left-to-right order. 

Continuing our example, the first step would 
yield 
AGTC (17 1)(2,1) (3,2) (7, 1)4¢8 pd) (SD) 5, 
and the second step would yield the original 
sequence. From this decoding algorithm it should be 
obvious that the original sequence can be obtained 
despite overlaps of pointers and their targets, as is 
the case with the pointer (7,3) inour example. 

When encoding a target sequence relative to a 
source sequence, the pointers point to the occur- 
rences of the same words in the source. 

Consider an example where the target sequence is 
GATTACCGATGAGCTAAT 
and the source sequence is 
ATTACATGAGCATAAT 
The occurrences of some words in the target may be 
replaced by pointers indicating the beginning and 
the length of the occurrences of the same words in 
the source, as follows: 

G(1,4)CCG(6,6) (13,4) 

The decoding algorithm A is very simple: it only 

needs to replace pointers by words 
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parsed segment HUMGHCSA 25,801-26,000: 


A-G-A-A-AGAAAGAAAGA - GAGAGAGAGAGAGAGAGAGAGA -A- AGAAAGAAAGAAAGAAAGAAAGAAA 
GAAAGAAAGAAAGAAAGAAAGAAAGAAAGAA-G-GAAAGAAAGAAAG-GAAA-C-T-A-A-A-A-T-AAC 
TAAA-TAACT-G-A-G-T-A-G-C-A-C-C-A-CACCAC-C-T-G-C-T-C-T-G-G-AGAAAGGA-C-T 
-T-T-T-G-TTGTTGTTGTTGTTGTTGTTGT-C-GTTGTT1 


local alignment of segments HUMGHCSA 25,801-26,000 (top) 
and HUMGHCSA 11,201-11,400 (bottom): 


@10 a20 @30 @40 @50 @60 
GAAAGAAAGAAAGAGAGAGAGAGAGAGAGAGAGAGAAAGAAAGA -AAGAAAGAAAGAAAGAAAGAAAG-A 
KKK DEK K TKK KK KK DRAKA DKK DKK KT KD KKK TKK KKK KKK KKK KK DDK KD KK DKK KKK IT KKKKKEKK 
GAAGGAAGGAAAGAAAGAAAGAAAGAAAAAGAAAGAAAGAAAGAGAAAGAAAAAGGAAAGCAAGAAAGAA 

ald @20 a30 @40 @a50 @60 a70 


@70 @80 a90 @i00o @110 @120 @130 
AAGAAAGAAAGAAAGAAAGAAAGAAGGAAAGAAAGAAAGGAAACTAAAATAACTAAATAACTGAGTAGCA 


KKEKKKKKKKKKKKKKKE 


AAGAAAGAAAGAAAGAA - -AAAGAAAGAAGGAAAGAAAAGAAACTAAAATAACTAAATAACTGAGTAGCA 


KREKKKK KKK TK KEK KKK KTR KKK KKK KKK KKK KKK KKK KKK KKKKAEKKKEKEK 


aso @90 @100 @l10 @120 @130 


@i40 @150 al60 @l70 


@140 @150 @l60 


also @190 


CCACACCACCTGCTCTGGAGAAAGGACTTTTGTTGTTGTTGTTGTTGTTGTTGTCGTTGT 
FRR RI RRR RRR RR RG Dok kk kkk 
CCACACCACCTGCTCTGGAGAAAGGACTTTTGTTGTTGTTGTTGTTGTTGTCGTTGTTGT 
@170 


@180 @l90 


Fig. 25.13 HUMGHCSA genomic region: parsing of 
segment 25,801—26 000 and its local alignment with 
segment 11,201—11 400. This homology between simple 
regions was discovered by the SMPL program. Note the 


In either kind of encoding, one can think of the 
encoded sequence as being parsed into words that 
are replaced by pointers and into the letters that do 
not belong to such words. One may then represent 
the encoding of a sequence by inserting dashes to 
indicate the parsing. In the self-encoding example, 
the parsing is 
A-G-T-G-AGT Tr 
while in the relative-encoding example the parsing 
is 
G-ATTA]~€—€—G-ATGAGE TAAL 

Note that there are many possible encodings. We 
will be particularly interested in the shortest ones. 
With every saved bit we improve significance 
twofold: the algorithmic significance method states 
that d bits can be saved by chance with probability at 
most 2“ (refs 29 and 30). Concise self-encoding 
indicates that sequence is not random while concise 
relative encoding indicates homology. 

To apply the algorithmic significance method, we 
need to count the number of bits that are needed for 
a particular encoding. We may assume that the 
encoding of a sequence consists of units, each of 
which corresponds either to a letter or to a pointer. 


(GA) * N pattern that is present around position 20 in 
the top segment but does not occur in the bottom 
segment. This pattern may have occurred by duplication 
of the GA dimer, a frequently occurring mutation. 


Every unit contains a (log 5)-bit field that either 
indicates a letter or announces a pointer. A unit 
representing a pointer contains two additional fields 
with positive integers indicating the position and 
length of a word. These two integers do not exceed n, 
the length of the source sequence. Thus, a unit can be 
encoded in log 5 bits in case of a letter or in log 5+ 
2 log n bits in case of a pointer. 

If it takes more bits to encode a pointer than to 
encode the word letter by letter, then it does not pay 
off to use the pointer. Thus, the encoding length of a 
pointer determines the minimum length of common 
words that are replaced by pointers in an encoding 
of minimal length. 

Note that we do not need to actually construct 
encodings—it suffices to estimate the encoding 
lengths. Thus, we may assume that we have even 
more powerful decoding algorithms that would 
enable smaller pointer sizes. For further details on 
pointer sizes, see ref. 29. 

The encodings of minimal length can be com- 
puted efficiently by a classical data compression 
algorithm [32]. We here focus on the algorithm for 
encoding one sequence relative to the other. The case 
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when a sequence is self-encoded requires only a 
slight modification. 

The minimal length encoding algorithm takes as an 
a target sequence f and the encoding length p of 
pointer and computes a minimal length encoding of 
t for a given source s. Since it is only the ratio between 
the pointer length and the encoding length of a letter 
tk matters, we assume, without loss of generality, 
that the encoding length of a letter is 1. 

Let n be the length of sequence t and let tk denote 
the (n—k+1)-letter suffix of t that starts in the 
kth position. Using a suffix notation, we can write 
f, instead of ¢. By I,(t,!s) we denote the mini- 
mal encoding length of the suffix t, Finally, 
let 1G), where !<i<n, denote the length of the 
longest word that starts at the ith position in target t 


and that also occurs in the source s. If the letter at 
position i does not occur in the source, then /(i) =0. 
Using this notation, we may now state the main 


- FECUTFENCE: 


L(t;|s) =min( + L(t, |), p+ [alti |) 


Proof of this recurrence can be found in ref. 32. 

Based on this recurrence, the minimal encoding 
length can now be computed in linear time by the 
following two-step algorithm. In the first step, the 
values /(i), 1 < i < n are computed in linear time by 
using a directed acyclic word graph data structure 
that contains the source s [34]. In the second step, the 
minimal encoding length L,(t|s)=L,(t,|s) is com- 
puted in linear time in a right-to-left pass using the 
recurrence above. 
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Why study the genomes of other organisms? This 
question is often asked by clinicians and other 
medical researchers, and carries the implication that 
humans are the only species on which the limited 
funding available for medical research should be 
spent. However, as the recent completion of the 
DNA sequence of the genome of the yeast 
Saccharomyces cerevisiae shows, much information 
can be gained from the study of other organisms that 
will have direct applications to the study of human 
diseases. 

Apart from the fact that some model organisms 
are of commercial importance in their own right, the 
major reasons for analysing the genomes of other 
organisms can be summarized as follows. 
¢ Gene function can usually be more easily deter- 
mined in a simpler model organism, especially 
where transgenic techniques and controlled breed- 
ing can be implemented, and this may throw light 
on the function of the gene in humans. 
¢ The regulation of gene expression can also be 
more easily studied through transgenic techniques 
and induced mutations. 
¢ Comparison of the genomes of different organ- 
isms will throw light on the evolution and conser- 
vation of gene function. 

e Some organisms can provide models for human 
disease. 
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One of the key reasons for studying simpler 
model organisms is to determine a basic set of 
eukaryotic functional genes. These can then act as a 
reference set of genes against which those of other 
plants and animals and of humans can be compared. 
Such comparative evolutionary studies will also 
reveal whether gene organization is conserved 
between particular species and, in addition, the role 
and position of introns and other chromosome 
elements can be determined and compared. This 
may give a greater insight into the role of such 
noncoding sequences in the control of transcription 
and gene expression. Finally, the pattern and timing 
of gene expression during development can be more 
easily studied in simpler organisms. 

Chapters 26-34, which make up this section of the 
Handbook, give an insight into the progress of 
genome projects in a wide range of species that are 
either important as laboratory models or of com- 
mercial importance. The model organisms include 
the mouse (Chapter 26, G. Argyropoulos & S.D.M. 
Brown), the fruit fly (Drosophila melanogaster) 
(Chapter 28, R.D.C. Saunders), the nematode worm 
(Caenorhabditis elegans) (Chapter 29, J. Sulston, R. 
Waterston and members of the C. elegans Genome 
Consortium), and the bacterium Escherichia coli 
(Chapter 31, M. Masters & G. Plunkett). Model 
plants are represented by Arabidopsis thaliana 
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(Chapter 33, M. Delseny & R. Moore). Organisms of 
commercial importance include the brewing and 
baking yeast (Saccharomyces cerevisiae) (Chapter 30, 
H. Feldmann) and rice (Oryza sativa) (Chapter 34, T. 
Sasaki et al.). In addition, Chapter 27 (R. Vile) reviews 
progress in somatic gene therapy in humans. 

There are various reasons for the concentration of 
the genome communities on these organisms. The 
mouse is a mammal like ourselves, with many genes 
homologous to those of humans, and the study of its 
genome has closely paralleled the Human Genome 
Project. There are numerous models of human 
inherited diseases in the mouse, and the ability to 
construct transgenic mice with precisely targeted 
gene knockouts now makes it even more useful. 

The yeast S. cerevisiae is of commercial importance 
and has also been one of the most favoured model 
organisms for studying the regulation and control of 
the cell cycle in eukaryotes. The recently completed 
genome DNA sequence is the first complete genome 
sequence of a eukaryote. The genetics and develop- 
ment of the nematode worm C. elegans have also 
been intensively analysed and this simple mullti- 
cellular organism will be the first multicellular 
eukaryote to have its genome sequence completed 
(by 1997). The fruit fly Drosophila has been used as a 
genetic model for more than 100 years, and there is a 
great wealth of information available on its genes 
and mutations. These have provided important 
insights into the genetic basis of development of 
multicellular organisms in general. Mutant and 
transgenic flies can readily be generated and techni- 
ques are available for the rapid identification of the 
genes affected. 


The bacterium E. coli was for many years the 
principal experimental organism for most molecular 
biologists. In recent years its dominance in the field 
has lessened as it has become possible to study and 
manipulate eukaryotic genes at the molecular level, 
and interest has shifted from prokaryotes to 
eukaryotes. However, its genome sequence is still of 
great interest because of the wealth of knowledge on 
gene function and expression in E. coli, and for 
comparison with other prokaryote genomes now 
being sequenced. E. coli is also still of interest as a 
pathogenic bacterium, as witnessed in Japan in 
1996 with the outbreak of food poisoning contracted 
from strain 0157 and affecting over 6000 school- 
children. 

Chapter 32 (P. Jauhar) outlines the particular 
problems associated with studying plant geno- 
mes —for example, the hybrid origin and polyploidy 
characteristic of many crop species such as wheat. 
This is a valuable insight into an area unfamiliar to 
many genome scientists. Of the two plant species 
discussed in detail in this section, A. thaliana 
(Chapter 33) is a small insignificant weed, but unlike 
many plants it has a small compact genome and has 
been chosen by the plant community as an ideal 
organism for developmental genetics, gene map- 
ping and genome sequencing. It is simple and 
quick to grow and many mutants are available 
for study. Rice, the other plant species represented 
here (Chapter 34), is an important staple crop, 
and has been chosen by Japan and other countries 
as a key organism for study due to its importance 
as a food and as a model for the family of grassses 
as a whole, to which many staple crops belong. 
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26.1 Introduction 


As the Human Genome Project progresses towards 
the determination of the complete DNA sequence of 
the human genome, a parallel project, with similar 
aims, is under way for the mouse genome [1]. The 
role of the mouse as a model organism for studying 
genetic disorders in humans is long established, 
and the creation of detailed genetic and physical 
maps of its chromosomes will facilitate its future 
use. 

A number of so-called model organisms have 
played an important role in the Genome Project. The 
construction of complete genetic and physical maps 
of model organisms presents an unparalleled oppor- 
tunity for comparisons of related genomes that will 
contribute significantly to our knowledge of: 

1 the function of genes and genomes and; 
2 the mechanics of evolutionary changes of 
genomes. 

The mouse, with its short breeding cycle of 
around 8 weeks (gestation time, 21 days; length of 
time to sexual maturity, 4-6 weeks), represents an 
ideal organism for the study of mammalian genetics 
and genome mapping. 

This chapter introduces the mouse as a laboratory 
organism, its capacity as a tool for the study of 
complex mammalian genetic systems, and the role it 
is playing in the dissection of genome organization. 
In Section 26.2 we present a short background to the 
laboratory mouse and its potential as an experi- 
mental model in mammalian genetics. In Sections 
26.3 and 26.4, we examine the strategies used to map 
the mouse genome, providing examples of specific 
genome mapping approaches. 


eee eee 


Phenotype 


26.2 The laboratory mouse 


26.2.1 Mutations in the laboratory mouse 


A wide range of mutations have been identified in 
the laboratory mouse including gene defects 
affecting skin texture, coat colour, tail shape and 
length, the skeleton, the eye, the inner ear, neurology 
and neuromusculature as well as genes affecting 
behaviour and reproduction (Table 26.1). Over 100 
mutations have been identified as having some 
neurological or neuromuscular effect and another 
100 mutations have been shown to affect the 
skeleton. In addition, a large number of mutations 
have been identified affecting the function of a 
variety of proteins including enzymes and cell- 
surface antigens. Many mutations that have been 
characterized appear to be monogenic. On the other 
hand, some strains of mice appear to be carrying 
defects that are polygenically determined. Such 
polygenic mice include strains of mice predisposed 
towards diabetes and obesity. The vast array of 
mouse mutations represents a powerful collection of 
animal models for the study of human disease. As 
well as spontaneous and induced mutations, many 
additional mutations have been produced by gene 
knock-out (see Appendix VIII). 


26.3 Strategies for mapping the 
mouse genome 


For most mutations the underlying gene carrying 
the altered DNA sequence is not known. The Mouse 
Genome Project aims to provide the resources 
needed to map and identify the mutated gene. The 


Table 26.1 Mouse mutants and 


Number of mutantloci —_jcj grouped according toa 


Colour and white spotting 

Skin and hair texture 

Skeleton 

Tail and other appendages 

Eye 

Inner ear and circling 

Neurological and neuromuscular 
Other behavioural 

Haematological 

Endocrinological, hormonal, growth and obesity 
Reproductive organs, sterility 

Defects of viscera 

Immune defects 

Homozygous lethality or sublethality 


s variety of phenotypic classes. 


94 
134 
94 
87 
45 
126 
19 
56 
98 
45 
48 
21 
115 


The number of mutants known in each class is given. Adapted from ref. 14. 
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overall strategy for the provision of complete 
genetic and physical maps of the mouse genome 
involves a number of elements including: 

1 isolation and characterization of cloned DNA 
fragments—DNA markers—from the mouse 
genome; 

2 establishing suitable genetic crosses with high 
resolution that allow us to complete high-density 
genetic maps of DNA fragments across the entire 
mouse genome; 

3 linking our mapped DNA fragments into a com- 
plete physical map of the mouse genome that gives 
us access to all of its incumbent DNA sequences. 


26.3.1 Genetic mapping of the mouse genome: 
methodologies 


26.3.1.1 DNA markers 

The pivotal feature of DNA markers used in mouse 
genetic mapping is their ability to detect some kind 
of sequence variation between parental strains that 
allows us to follow the segregation of the poly- 
morphic locus in the appropriate genetic cross (see 
below). The methods used for detecting sequence 
variation vary according to the type of marker 
employed. The main approaches used are as 
follows. 

1 Analysis of the segregation of a restriction fragment 
length variant (RFLV) detected between the parental 
strains used in the cross (see Fig.26.1b). This analysis is 
applicable to both coding sequences as well as 
random, nongenic cloned fragments, though in the 
former case it may be more difficult to detect an 
RFLV ina conserved, coding region of the genome. 

2 Analysis of segregation of simple sequence length 
polymorphisms. Mammalian genomes contain fre- 
quent short tandem repeat sequences called micro- 
satellites (refs 2-4, see also Chapter 5). These micro- 
satellites are often composed of a short array of 
dinucleotide repeats. Dinucleotide repeats have 
been shown to vary in length even between closely 
related inbred laboratory strains of mice. Length 
variation in a microsatellite is known as a simple 
sequence length polymorphism (SSLP). Such SSLPs 
can be used as markers to detect the segregation 
pattern of a locus of interest (see Fig. 26.1c). 
Moreover, microsatellite sequences are present at 
frequent intervals in the mouse genome with one 
microsatellite every 20 kb or so, representing a total 
of 150000 microsatellites in the entire mouse 
genome [3]. 

3 Analysis of the segregation of variants of interspersed 
repeat sequence polymerase chain reaction (IRS-PCR) 
products. PCR of genomic DNA using primers to 
interspersed repeat sequences in mammalian 


genomes allows the recovery of PCR products to the 
sequences between closely spaced repeat sequences 
(see Chapters 10 and 14). In the mouse, two major 
short interspersed repeat sequence families have 
been used for the generation of IRS-PCR products — 
the B1 and B2 repeat families, both present in around 
50000-100000 copies per genome [5]. In humans, 
the short interspersed repeat sequence Alu (equi- 
valent to B1 in the mouse) has been used to generate 
IRS-PCR markers in a similar fashion (see Chapters 
9 and 10 for protocols). IRS-PCR product length can 
vary between different strains or species of mice, 
allowing us to score their segregation in genetic 
crosses based upon a length polymorphism. Alter- 
natively, depending upon variation in repeat loca- 
tion between strains or species of mice, presence/ 
absence polymorphisms may be observed that may 
be amenable to segregation analysis depending 
upon the nature of the genetic cross employed (see 
ref. 6 and below). Presence/absence polymorphisms 
are the most common variant observed between 
species. 


26.3.1.2 Interspecific and intersubspecific 
genetic crosses 
The interspecific back-cross [7] has become one of 
the most important genetic tools for the construction 
of genetic maps spanning the mouse genome 
(Fig. 26.1a). A laboratory strain of mouse is crossed to 
a wild species of mouse, Mus spretus. Fertile female 
progeny can be produced which are then back- 
crossed to either the parental laboratory strain or 
the M. spretus strain. Thus, back-cross progeny are 
produced that segregate DNA sequence variation 
derived from the laboratory strain or M. spretus. 
Sometimes a subspecies (M. castaneus) related to 
the laboratory mouse is used, in which case the cross 
is known as intersubspecific [3]. The laboratory 
strain used in the back-cross may be carrying an 
interesting mutation and the back-cross progeny 
produced will also be segregating the mutation of 
interest. Thus, interspecific back-crosses not only 
produce the resources for mapping of DNA markers 
but also may provide the necessary resources for 
mapping and localizing mutations (see below). 


26.3.1.3 Advantages of an interspecific back-cross 

Laboratory strains of mouse and the wild species M. 
spretus separated some 2-3 million years ago. M. 
castaneus and laboratory strains diverged somewhat 
later. However, the DNA sequence of the two 
parental strains in both interspecific and inter- 
subspecific crosses is highly diverged, making it 
relatively easy to identify sequence variation 
between the parental genomes for any DNA marker. 
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Over 90% of microsatellites show SSLPs between 
laboratory mice and M. spretus and this percentage is 
not much lower when M. castaneus and other 
laboratory strains are compared [3]. In addition, 
isa remarkably high rate of variation between 
idard laboratory inbred strains of mice. Around 
7% of microsatellites demonstrate scorable SSLPs 
between inbred strains of mice [2,3]. Thus, it is pos- 
sible to carry out considerable genetic analysis of 
DNA markers even in genetic crosses employing 
standard laboratory strains. When mapping mouse 


mutations, it is important to be aware that the 
penetrance or expressivity of the mutation (see 
Chapter 1) may vary according to the genetic 
background of the mouse strains employed in the 


cross. For some mutations, it may be prudent to 
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employ a back-cross strategy that does not involve a 
wild mouse species. 


26.3.1.4 Analysing a mouse back-cross 

DNA from the back-cross progeny is analysed for a 
number of types of sequence variation including 
RELVs or SSLPs, as described previously. For RFLVs 
and SSLPs, each DNA marker can be defined as a 
sequence-tagged site (STS) since at least a portion of 
its sequence is usually known. The sequence data are 
readily transferred to another laboratory and the 
DNA marker can be reproduced by manufacturing 
primers for each STS and the application of PCR. The 
high DNA sequence variation that characterizes 
the parental strains of interspecific and intersub- 
specific crosses means that it is feasible to analyse 
interspecific back-cross progeny for all available 
DNA markers. The ability to analyse many DNA 
markers in one back-cross in a multipoint fashion 
allows us to order our markers by minimizing the 
number of observed recombination events [8]. This 


Fig. 26.1 Genetic mapping using a mouse interspecific 
back-cross. (a) Construction of a genetic map of five 
markers by multipoint analysis on one mouse 
chromosome. Laboratory strain sequence variants are 
shown in bold type and Mus spretus strain sequence 
variants are shown in italics. Anumber of back-cross 
progeny are recovered derived from recombination 
between the laboratory strain chromosome and the M. 
spretus chromosome in the F1 female. The genetic order of 
markers on the chromosomes is determined bya 
haplotype analysis that minimizes the number of 
observed recombination events across the back-cross 
progeny set. Changing the order of markers would 
necessitate an overall increase in recombination events 
observed and, in addition, the appearance of triple 
recombinants, which are usually very rare. (b, c) 
Diagrammatic representation of the analysis of two 
markers from our interspecific back-cross (loci 2 and 3} 
see part a). Locus 2 (part b) has been analysed using a 
RELV. DNA from each back-cross progeny is digested 
with the appropriate restriction enzyme, fragments 
separated on an agarose gel and transferred to a nylon 
membrane. The membrane is then hybridized with 
radiolabelled marker 2 and following exposure of the 
membrane to autoradiographic film, the segregation of 
the RFLV can be read. Locus 3 (part c) is analysed by 
following the segregation of an SSLP. The SSLP is 
visualized by amplification of the variant microsatellite 
using radioactively labelled primers and PCR followed 
by running the products on an acrylamide gel. The gel is 
exposed to autoradiographic film allowing the 
segregation of the SSLP can be read. For locus 2, back- 
cross progeny 4 has not inherited a spretus variant from 
the Fl female but has inherited a spretus variant for locus 
3. Thus, back-cross progeny 4 is recombinant between 
locus 2 and locus 3 (see part a). 
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is sometimes known as haplotype or pedigree 
analysis (see Chapter 1). 

The methodology used in the interspecific back- 
cross means that mouse genetic mapping differs 
qualitatively from that usually appropriate to the 
human genome and human genetic mapping (see 
Chapters 1-3). The emphasis is on multipoint 
mapping within a single pedigree, which allows the 
mouse geneticist to construct high-integrity, high- 
resolution and ordered genetic maps of DNA 
markers across most of the mouse genome. 


26.3.1.5 Alternatives to a back-cross analysis 

One alternative mapping resource in the mouse that 
has been extensively used in the past, but less so 
recently, are the recombinant inbred (RI) strains [9]. 
RI strains are recovered by mating two inbred 
laboratory progenitor strains to obtain first an F1, 
and then an F2 generation. This is followed by 
rounds of brother-sister matings (Fig. 26.2). Anum- 
ber of inbred strains are established that carry one 
or more recombination events between the parental 
mouse chromosomes. Analysis of parental sequence 
variation for DNA markers from each chromosome 
in a number of RI strains establishes a strain distri- 
bution pattern (SDP). Analysis of new DNA markers 
through a variety of RI strains, comparison of 
the SDPs and computation of linkage to already 
mapped markers enables the mouse geneticist to 
arrive at an accurate chromosomal assignment 
(Fig. 26.2). The analysis is formally similar to 
the haplotype or pedigree analysis carried out 
for interspecific and intersubspecific back-crosses. 
However, there are two limitations. First, RI strains 
carry a limited number of recombinants on each 
chromosome and therefore limited resolution. 
Second, their derivation from closely related labora- 
tory inbred strains hinders the discovery of sequence 
variation for some DNA markers (see above). 


26.3.2 Genetic mapping of the mouse genome: 
current status of the mouse genetic map 


A recent report [4] describes an integrated map of 
over 7000 STSs, including genic probes and micro- 
satellites, covering the entire mouse genome. Nearly 
90% of the markers on this map were readily 
analysable SSLPs. 


26.3.2.1 The microsatellite map of the mouse genome 

The first microsatellites from the mouse genome 
were developed by Todd and colleagues and used to 
construct a preliminary map of the mouse genome 
using RI strains [2]. Subsequently, a map of 317 
microsatellite markers was developed by Dietrich 
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Fig. 26.2 The construction and analysis of recombinant 
inbred (RI) strains. The principle behind the mapping of 
new loci in RI strains is illustrated. A new locus is 
assigned by comparing the strain distribution pattern 
(SDP) observed with those of previously mapped loci. 
For the chromosome illustrated, the SDP has been 
determined at four loci (1-4) in four RI strains. For the 
new locus, the SDP was determined in each of the four RI 
strains. The SDP corresponds to that of locus 3, indicating 
that the new locus maps close to locus 3 on this 
chromosome. 


and co-workers [3]. (CA), microsatellites were iden- 
tified and sequenced from total genomic C57BL/6J 
M13 clone libraries. Each microsatellite was anal- 
ysed through a variety of inbred strains as well as M. 
spretus and M. castaneus DNA. SSLPs were analysed 
through the 46 F2 progeny arising from an intercross 
of an inbred obese mouse strain and M. castaneus. 
This mapping panel provides 92 meioses, giving a 
genetic resolution of 1 crossover per 1.1cM. In 
addition, the available SSLPs were analysed through 
RI lines to anchor the microsatellite map to the 
known genetic map. 

The Massachusetts Institute of Technology (MIT) 
microsatellite map was constructed using the 
MAPMAKER linkage package. The 317 SSLPs 
mapped initially had an average spacing of 4.3 and 
covered 99% of the mouse genome. The latest 
update of the microsatellite map [4] contains 6580 
SSLPs with an average spacing of around 0.2 
between markers. This analysis clearly provides a 
microsatellite map of the mouse genome at high 
density, but of only intermediate resolution. Many 
microsatellites remain unresolved and unordered. 
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26.3.2.2 Further integration of the microsatellite map 
with the gene map of the mouse 

Copeland and Jenkins and colleagues have in 
parallel developed a dense gene map of the mouse 
genome, principally by the analysis of CDNA RFLVs 
through an interspecific back-cross [10,11]. Around 
800 gene loci have been ordered across all 20 
chromosomes. Efforts have been made to integrate 
the gene and microsatellite maps by the analysis of 
over 1000 SSLPs through the interspecific back-cross 
used for mapping the genic markers [4]. The result 
provides an integrated map of the mouse genome 
but, nevertheless, again a map of only intermediate 
resolution, where many markers remain unresolved 
and unordered. 


26.3.3 Towards completing the 
mouse genetic map: a high-resolution genetic map 
of the mouse genome 


There are considerable advantages to establishing a 
high-resolution genetic map where the bulk STSs are 
resolved and ordered. Such a map provides a strong 
basis for the construction of high-integrity physical 
maps of the mouse genome (see below). Interspecific 
back-crosses can provide very fine genetic reso- 
lution. For example, 1000 back-cross progeny offers 
genetic resolution at the 0.3 level with 95% confi- 
dence. The DNA content of the mouse haploid 
genome is 3x10’ base pairs. Thus, 0.3 represents 
~0.5Mb of DNA. This is a level of resolution 
approachable by physical mapping techniques (see 
Section 26.4). An STS genetic map of the mouse 
genome approaching 0.5Mb resolution will be 
completed by the end of 1996 and will provide a 
template for the global physical mapping of the 
mouse genome. The intermediate resolution map of 
over 6000 microsatellites that is now complete (see 
ref. 4 and above) is a critical resource from which to 
develop a high-resolution map. 

A collaborative programme is under way in 
Europe to develop a high-resolution microsatellite 
map of the entire mouse genome [8]. This pro- 
gramme— the European Collaborative Interspecific 
Back-cross (EUCIB) programme— aims to complete 
a microsatellite map of the mouse to 0.3 resolution. 
Nine hundred and eighty-two interspecific back- 
cross progeny have been generated, of which 501 
mice were from F1 females back-crossed to M. 
spretus and 481 mice were from F1 females back- 
crossed to C57BL/6. All 982 back-cross progeny 
were analysed for over 70 anchor loci distributed 
across all chromosomes. Completion of the anchor 
map has allowed the identification of pools of 
animals recombinant in individual chromosome 
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regions and enables the high-resolution mapping of 
new markers in a chromosome region by their analy- 
sis through a limited collection of animals. Each 
SSLP that has already been mapped to intermediate 
resolution and to a particular chromosome region is 
analysed through the relevant recombinants in the 
EUCIB back-cross panel thus developing a high- 
resolution, ordered microsatellite map of the mouse. 

As half of the EUCIB back-cross was generated by 
back-crossing to M. spretus, it is possible to use the 
EUCIB resource for the rapid, high-resolution 
genetic mapping of IRS-PCR products [6]. Many of 
the IRS-PCR products generated in laboratory 
strains of mice do not detect a product when 
hybridized to IRS-PCR products of M. spretus. In M. 
spretus the arrangement of repeat sequences can be 
different, thus leading to a presence/absence 
variation between the species. This allows for a 
relatively rapid system for scoring the segregation of 
IRS-PCR products through back-cross progeny. IRS- 
PCR products are hybridized to gridded arrays of 
IRS-PCR products from the progeny derived by 
back-crossing to M. spretus. Presence or absence of 
signal for each back-cross progeny is rapidly scored 
and as with microsatellite or genic markers, linkage 
and haplotype ordering can be computed. IRS-PCR 
markers are set to make an increasing contribution 
to the high-resolution maps of the mouse genome. 

The EUCIB programme is supported by the MBx 
database [8]. The MBx database was constructed 
using distributed client/server database Sybase, 
Sybase application development tools (APT), and 
the C and XView programming languages. MBx 
stores all locus, probe and SSLP data. Allele data at 
each locus is presented as a scrollable matrix on 
screen. MBx can compute genetic linkage between 
loci and, in addition, can automatically perform the 
necessary haplotype analysis by minimizing the 
number of observed recombinants across any 
chromosome region in order to derive genetic order 
of loci. Furthermore, MBx can abstract all mice 
carrying recombinants in any chromosome region. 
Such information when downloaded to robots is 
important for the automated and error-free selection 
of the correct DNAs for subsequent high-resolution 
microsatellite mapping in a particular chromosome 
region. Map information is displayed through a 
modified front-end version of the ACeDB developed 
for Caenorhabditis elegans (see Chapter 29). 

In Fig. 26.3, the latest high-resolution microsatel- 
lite map for proximal mouse X chromosome (A. 
Haynes, N. Quaderi & S.D.M. Brown, unpublished 
data) is presented in ACeDB format. The ACeDB 
display is a multimap format demonstrating both 
the MIT microsatellite map at low resolution and the 
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Fig. 26.3 High-resolution 
microsatellite map of the mouse 
X chromosome. An ACeDB 682 
database MultiMap display is 
shown for microsatellite maps of 
the mouse X chromosome. On 
the left is shown the low- 70% 
resolution MIT map (MIT.g.Chr 
X) (see text). On the right is the 
high-resolution microsatellite 
map generated on the EUCIB 
back-cross (MBX.g.Chr.X). ee 
ACeDB Multimap displays 
connecting lines between 
identical loci mapped in 
different crosses allowing both 902 
comparisons of gene order as 
well as comparisons of 
resolution. The higher resolution 
of the EUCIB map is 
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high-resolution map completed on the EUCIB back- 
cross. The increase in resolution is immediately 
apparent. 


26.3.4 Accessing mouse genetic map information 


26.3.4.1 The Mouse Genome Database and the 
Encyclopedia of the Mouse Genome 

Mouse genetic mapping information from centre 
programs, collaborative programs and single labo- 
ratory efforts worldwide is regularly transferred to 


the Mouse Genome Database (MGD) at the Jackson 
Laboratory (Bar Harbor, ME, USA) and presented in 
the latest issue of the Encyclopedia of the Mouse 
Genome—a tool for the presentation of Mouse 
Genome and related information. MGD contains 
mouse locus information; genetic mapping data; 
mammalian homology data; probes, clones and PCR 
primers; genetic polymorphisms; the Mouse Locus 
Catalogue (gene descriptions) and characteristics of 
inbred strains. MGD is available over the World 
Wide Web as are various other services offered by 
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the Jackson Lab. Information can be found on the 


MGD WWW home page at http://www.infor- 
miatics.jax.org. 


26.3.4.2 Updates on the Whitehead/MIT Centre for 
Genome Research microsatellite maps 

Information on the latest releases of the MIT 
microsatellite map can be obtained via the World 
Wide Web: http: / /www-genome.wi.mit.edu. 


26.3.4.3 The MBx database: accessing the 
high-resolution microsatellite map 

The latest EUCIB maps and mapping data held on 
the MBx database are available the World Wide Web: 
http://www /hgmp.mrc.ac.uk/MBx/MBxHomepage. 


26.3.4.4 Other sources of mouse genetic map 

information 

There are a number of other routes to access current 
mouse genetic map information. The genetic maps 
are collated, examined and updated yearly by chro- 
mosome committees. yearly chromosome committee 
reports that include the current genetic maps are 
published in the journal Mammalian Genome (e.g. 
ref. 12). These reports form much of the basis for 
updates of MGD. 

Mouse Genome [13] is a specialist mouse genetics 
publication with four issues per year that contains 
updates on gene names, chromosomal localization 
of genes and new loci, updates on nomenclature as 
well as short communications on genetic and 
physical maps. The first issue each year deals with 
linkage maps and maps of chromosomal anomalies. 
The second issue contains listings of gene symbols 
and chromosome anomalies. The third issue alter- 
nates between (a) information on the history and 
location of inbred strains, congenic strains and 
recombinant inbred strains and (b) lists of DNA 
clones and probes. The fourth issue has listings of 
RFLPs. Information about subscribing to Mouse 
Genome can be obtained via e-mail: j.peters@har.mrec. 
ac.uk. 

Finally, at regular intervals, the publication 
Genetic Variants and Strains of the Laboratory Mouse 
[14] is updated and reissued. This contains extensive 
listings and discussions of all that is valuable for 
the mouse geneticist, including the mouse locus 
catalogue, linkage homologies between mouse and 
human, nomenclature rules, chromosomal anoma- 
lies, inbred strain listings amongst others. The third 
edition has just been published. 


26.3.4.5 Nomenclature in the mouse 
Refer to recent editions of Mouse Genome [13] for the 


latest revisions on the extensive rules that govern 
nomenclature in mouse genetics. 


26.3.5 Using the mouse genetic map 


A high-resolution, ordered genetic map of the 
mouse genome incorporating highly variable DNA 
markers has a number of important uses. 


26.3.5.1 Detailed genetic mapping of mouse mutations 
as a prelude to positional cloning 

The current genetic maps provide the necessary 
resources for the cloning of genes associated with 
the plethora of interesting mutations available in the 
mouse. To access the gene underlying a mutation 
it is necessary to carry out a specific cross that 
segregates the mutation of interest. An interspecific 
or intersubspecific back-cross carrying the mutation 
is produced as described in Section 26.3.1—often 
with 1000 or more back-cross progeny — allowing a 
high-resolution genetic analysis of the mutation’s 
position. Back-cross progeny are analysed not only 
for the segregation of the mutation but also for 
sequence variation of DNA markers from the 
vicinity of the mutation. The genetic analysis allows 
us to determine our most closely flanking DNA 
markers; an important first stage in localizing and 
ultimately accessing the mutation through further 
physical mapping (see Section 26.4). In an inter- 
specific back-cross of 1000 or more progeny, an STS 
nonrecombinant with the mutation would lie within 
0.5 Mb of the mutated gene. This is the first stage ina 
positional cloning strategy based upon the powerful 
use of the high-resolution genetics available in the 
mouse. 


26.3.5.2 Identification of candidate genes for 

mouse mutations 

Once a new gene has been added to the genetic map, 
its position can be compared with the map position 
of known mutations. Depending upon what is 
known of the function of the gene and the 
phenotype of the mutation, the newly mapped gene 
may be a plausible candidate for the site of the 
mutation and this can be tested by direct sequence 
analysis. For example, the B-subunit glycine recep- 
tor subunit gene (Girb) maps to mouse chromosome 
3 in the vicinity of the spastic mutation (spa). It was 
found that Glrb mRNA was reduced throughout 
brains of spa mice, apparently as the result of a LINE- 
1 element insertion in intron 6 of the Girb gene [15]. 


26.3.5.3 Genetic mapping of polygenic loci 
The rapid expansion in the density of microsatellite 
markers on the mouse genetic map has been an 
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important advance in helping investigators to 
identify the map position of loci involved in 
polygenic traits in the mouse. It is important to 
emphasize that the mouse is a very important model 
organism for the analysis of traits that are deter- 
mined polygenically. The analysis of polygenic traits 
is often more straightforward in the mouse or 
other laboratory organisms because defined genetic 
strains are available. For example, a back-cross 
between the non-obese diabetic (NOD) mouse that is 
predisposed to Type I insulin-dependent diabetes 
and normal laboratory strains provides back-cross 
progeny segregating the diabetes phenotype. Analy- 
sis of the back-cross progeny with microsatellites en- 
compassing the whole of the mouse genome allows 
the investigator to identify linkage association 
between individual microsatellites and susceptibility 
loci. In the mouse, a number of susceptibility loci on 
a number of chromosomes have been mapped that 
appear to predispose to Type I diabetes [16]. 


26.3.5.4 Comparative genetic maps of the 

mouse and human genomes 

Detailed mouse genetic maps are a powerful 
resource for comparison with the genome maps of 
other organisms, especially human. When the 
genetic maps of mouse and human are compared, it 
is found that there are many areas in the two 
genomes where gene content and gene order is 
conserved [17,18]. These regions are called con- 
served ordered segments. Given the density of genes 
mapped in common between mouse and human, it 
is believed that the bulk of conserved ordered 
segments between the two organisms have already 
been identified. 

It is worth discussing three examples where 
conserved ordered segments between mouse and 
human can be of significant use. 

1 Identification of putative homologous mutations 
between the mouse and human genomes. For example, 
human hereditary hyperefplexia, or startle disease 
(STHE), maps to distal chromosome 5q, and point 
mutations in exon 6 of GLRA1—the gene encoding 
the a1 subunit of the glycine receptor (mouse 
homologue Gira1)— have been identified in patients 
from STHE families. STHE shows a similar pheno- 
type to the mouse mutation spasmodic (spd) which 
maps on mouse chromosome 11 in a conserved 
linkage group with human 5q. The homology of 
spd and hyperefplexia was confirmed by the 
identification of mutants in the Glral gene in spd 
mice [19]. 

2 Identification of genes in the mouse genome that may be 
candidates for mutations in the human genome. For 
example, in the mouse, the gene underlying the 


shaker-1 mouse deafness mutation was recently 
identified by a positional cloning route [20,21]. The 
shaker-1 gene encodes an unconventional myosin 
molecule—myosin VII. This gene represented an 
attractive candidate for one form of the deaf—blind 
syndrome, Usher type 1b, in humans, which maps to 
human chromosome 11q13.5. The shaker-1 mouse 
mutant was known to map in a region of mouse 
chromosome 7 that forms part of a conserved 
ordered segment encompassing 11q13.5 in humans 
[22]. Indeed, it was shown that the myosin VII gene 
underlies Usher syndrome type 1b and a number of 
mutations have been identified in the myosin VII 
gene in affected families [23]. 

3 Identification of genes mapped in the human genome 
that are potential candidates for mutations in the mouse 
genomes. One recent example is the identification of a 
gene on the mouse X chromosome that carries the 
xid (X-linked immunodeficiency) mutation that 
causes defects in B-cell development. In humans, the 
X-linked agammaglobulinaemia mutation (XLA), 
which is also associated with disorders in B-cell 
development, lies in the same conserved linkage 
group as the mouse xid mutation. The gene affected 
in XLA encodes a cytoplasmic tyrosine kinase— 
Bruton’s tyrosine kinase, Btk [24]. The mouse Btk 
gene maps close to the xid mutation on the mouse X 
chromosome and it has been shown that in xid mice 
the Btk gene carries a point mutation [25,26]. 


26.4 Physical mapping of the 
mouse genome 


Genetic mapping provides an ordered array of STSs 
along the chromosome. A physical map is an 
ordered, overlapping array of DNA clones covering 
an entire chromosome or chromosome region. An 
overlapping set of DNA clones is known as a contig. 
The genetic map provides a framework upon which 
the physical map can be built by linking adjacent 
STSs on the genetic map into a set of overlapping 
DNA clones. 


26.4.1 Converting the genetic map into a 
physical map 


26.4.1.1 Clone resources for physical mapping 

Yeast artificial chromosome (YAC) clones [27] have 
played a pivotal role in the construction of physical 
maps of mammalian genomes. YAC clones can 
contain large inserts up to 1Mb or more, thus 
allowing considerable coverage of any chromosome 
region with a small number of clones. The use of 
YAC clones for coverage of large or entire chromo- 
some regions is most advanced in human genomics 
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studies [28] (see Chapter 15). There are, however, 
two drawbacks to YAC libraries. First, most YAC 
libraries, including mouse YAC libraries, demon- 
strate a high frequency of chimaeric clones— clones 
carrying insert material from noncontiguous 
regions of the genome [29]. Second, some clones are 
markedly unstable demonstrating deletions or other 
rearrangements of material. Nevertheless, these 
problems are outweighed by the advantages of very 
large insert size. Five mouse YAC libraries are 
currently available (Table 26.2). One of these 
libraries, the St Mary’s library, was constructed in a 
recombination-deficient (rad52) strain of yeast—the 
first such YAC library to be constructed —in order to 
lower chimaerism rates and improve stability of 
YAC clones [30]. 

A number of other vector systems that have the 
capacity for large insert size have been developed, 
including P1 [31], PACs [32] and bacterial artificial 
chromosomes (BACs) [33]. The P1 cloning system 
can accommodate inserts up to around 100kb (see 
Chapter 15), while PACs and BACs have a larger 
capacity, of around 130-150 kb and >300kb, respec- 
tively. These cloning systems, although unable to 
accommodate the very large insert sizes of YAC 
clones, have considerable advantages in terms of 
clone stability and low chimaerism rates within 
libraries. Mouse P1 [34] and BAC libraries are 
available. 


26.4.1.2 Construction of physical maps by STS content 

Screening of mouse YAC libraries using PCR or 
filter-based methods allows us to identify YAC 
clones covering adjacent STSs on the genetic map 
(Fig. 26.4). YAC clones carrying an adjacent series of 
STSs can be linked into an overlapping series of 
clones called a YAC contig, a procedure often 
referred to as STS content mapping. It can be expected 


Table 26.2 Available mouse YAC libraries. 
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Fig. 26.4 STS YAC contigging to create and analyse 
physical maps across the mouse genome. Five STSs (A-E) 
which have been shown by genetic mapping to lie close 
to each other on a chromosome are used to screen for 
overlying YAC clones. STSs shared by the four YACs 
allow the assembly of a YAC contig (STS YAC content 
mapping). An overlapping series of YAC clones that 
provides access to all the DNA sequences is generated in 
that region. In addition, the STSs may be analysed in an 
interspecific back-cross segregating for a mutation in that 
region. Some STSs (A, B and E) may be recombinant with 
the mutation in back-cross progeny and represent STSs 
flanking the region of the mutation; other STSs may be 
non-recombinant (C and D). The genetic analysis defines 
a region on the overlying YAC physical map where the 
mutation is likely to lie. The YAC contig in this region 
provides the necessary cloned sequences to test for and 
identify the relevant gene sequences carrying the 
mutation. 


that between 20000 and 30000 STSs might be 
needed to establish a robust physical map of the 
mouse genome with almost complete coverage. 
PCR screening is the most rapid and often the 
most reliable method of identifying clones from a 
YAC library. But it requires a highly organized 
scheme for pooling clones to identify the relevant 
YACs from the large number of clones in a YAC 
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library. For a programme of work directed towards 
the construction of a complete YAC physical map of 
the mouse X chromosome, we have rearrayed two 
mouse YAC libraries—the St Mary’s and ICRF 
libraries (see Table 26.2) —into a three-dimensional 
(3-D) format for rapid PCR screening. The combined 
library — the 3-D library —represents seven genome 
equivalents with an overall average insert size of 
500kb and can be screened by two rounds of PCR 
screening to identify the relevant YAC coordinate. 
Briefly, microtitre plates, each carrying 96 (12x8) 
clones, are arranged into 3-D stacks with each stack 
containing 72 microtitre plates: 6 plates in each floor 
and 12 floors in each stack (Fig. 26.5). The 3-D library 
contains eight stacks in total. DNA pools are 
prepared from all the clones in each of the 12 floors, 
24 rows and 24 columns in each stack (Fig. 26.5). In 
addition, DNA pools—superpools—representing 
all of the clones in each stack are prepared. In a first 
round of PCR screening using the superpools, those 
stacks containing positive clones are identified. This 
is followed by a second round of PCR screening of 
the relevant floor, column and row pools from the 
positive stacks providing,.a unique identifier for each 
positive YAC clone in the library. 


St. Mary'srad52 library ICRF library 


3.5 genome 
equivalents 


3.5 genome equivalents 
| 1-24 columns 


12 floors 


——+ 1-24 rows 


Fig. 26.5 Arraying mouse YAC libraries for rapid PCR 
screening. A schematic representation of the mouse 3-D 
yeast artificial chromosome (YAC) library that was 
prepared by combining the clones of the St Mary’s and 
ICRF YAC libraries (see Table 26.2). Each of eight stacks 
in the library contains 72 microtitre plates, each plate 
holding 96 (8 x 12) clones. Each stack consists of 12 floors, 
with 6 plates in each floor. Combined clones from each 
floor, column and row in each stack (a total of 60 — that is, 
12 + 24 +24 —pools) were used to prepare DNA. In 
addition, DNA from the 60 pools was also combined to 
prepare a superpool for each stack. The library can be 
rapidly screened by PCR by first screening the 
superpools to identify positives in stacks followed by 
screening of the relevant set of 60 floor/rows/columns 
pools to identify a unique coordinate. 


The alternative to PCR screening is hybridization 
screening using available STSs. High-density 
gridded filters of YAC clones can be used for 
hybridization screening of mouse YAC libraries [35]. 
High-density filters can be screened with available 
probes representing STSs from any region. It is not, 
however, possible to screen YAC libraries with 
microsatellite markers using hybridization tech- 
niques. As discussed below, YAC libraries can also 
be screened with IRS-PCR markers using filter 
hybridization techniques. 


26.4.1.3 Construction of anchored YAC framework maps 
of mouse chromosomes 

The first stage of our programme directed towards 
the coverage of the mouse X chromosome with YAC 
contigs is the establishment of an anchored YAC 
framework map. This involves screening the 3-D 
library with all available genetically mapped STSs, 
including the use of microsatellite primers and 
primers developed from known X chromosome 
gene sequences. To date, this work has allowed us to 
develop an anchored YAC framework map of the 
mouse X covering an estimated 50% of the chro- 
mosome: over 370 YAC coordinates have been 
identified to 139 STSs (N. Quaderi, G. Argyropoulos, 
A. Haynes & S.D.M. Brown, unpublished data). 
From the available data, 18 contigs have already 
been identified by common STS content. 


26.4.1.4 Databases to hold physical mapping data 

The MBx database has been modified and expanded 
to carry physical mapping data as well as genetic 
mapping data. For each STS, the relevant YAC 
coordinate information is stored, along with details 
of clone size, chimaerism, etc. Physical maps are 
generated by the use of an algorithm SAM2 [36]. 
This algorithm enables the user to automatically 
generate and subsequently modify, if required, 
contig tiling paths in any one chromosome region. 
Following assessment and manipulation of contig 
information in SAM2, contig information is then 
displayed on the Web (see MBx WWW address 
above). Both multimaps relating physical map infor- 
mation to genetic maps as well as more detailed 
displays of the STS content maps are available 
(Fig. 26.6). 


26.4.1.5 High-resolution genetic maps as an aid to 
physical mapping and contig closure 

The integrity of the growing YAC contigs is greatly 
assisted by the development in parallel of the high- 
resolution microsatellite map of the mouse X chro- 
mosome (see above). The high-resolution genetic 
map resolves and orders STSs to very fine resolution 
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26.6 The anchored YAC framework map of the 
‘chromosome. A display available through the 
N (http://www / hgmp.mre.ac.uk/MBx/MBxHome- 


2 


WV V 


page.html) of a portion of the anchored YAC framework 
map of the mouse X chromosome. Above the upper bar, 
STSs are illustrated —the bulk of them anchored either at 
low or high resolution on the genetic map (see, for 


and underpins the development of a high-integrity, 
overlying physical map. The genetic map provides 
confirmation on the integrity of growing contigs 
as well as eliminating false contig information. 
Furthermore, the genetic map allows us to orientate 
growing contigs and aids contig closure (Fig. 26.7). 
One route to contig closure across any chromosome 
region is the use of IRS-PCR to develop new markers 
as a tool to extend seed contigs. 
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Fig.26.7 High-resolution genetic maps allied to physical 
mapping for the efficient closure of chromosome MING 
contigs. A YAC contig covering loci from DXM it22, Hprt, 
DXMit23 and DXMit159 is illustrated. These STSs along 
with DXMit91, for which YACs have also been isolated, 
are unresolved on the low-resolution MIT map. In the 
EUCIB back-cross, two recombinants have been detected 
between Hprt and DXMit159 /91. No recombinants have 
been detected between DXMit159 and DXMit91. This 
high-resolution map information orients the 
DXMit22-DXMit159 contig with respect to DXMit91 
YACs and indicates the most efficient route to close the 
contig in this area by for example B1 walking. 


example, Fig. 26.3). Below the upper bar, the YACs 
detected by the anchored STSs are illustrated and a 
number of contigs can be identified. YAC coordinates for 
each anchored STS are given. The lower bar providesa 
schematic of the region of the X chromosome under 
examination. 


26.4.2 Physical maps using interspersed repeat 
sequences 


26.4.2.1 Recovering YACs to IRS-PCR markers 
IRS-PCR markers such as the B1 and B2 repeats are 
being used for the genetic mapping of mouse 
chromosomes (see Section 26.3.1.1). Additionally, 
these markers can be screened against the available 
YAC libraries to provide additional YAC clones for 
generating contigs in any region. Screening of IRS- 
PCR products against YAC libraries must involve 
hybridization techniques and there are two princi- 
pal approaches. First, an IRS-PCR marker can be 
hybridized directly to high-density gridded spots 
of IRS-PCR products from individual YAC clones. 
Alternatively, IRS-PCR markers can be hybridized 
to IRS-PCR products of DNA pools from three- 
dimensional arrays of YAC clones (see above and 
ref. 37). B1 IRS-PCR products are recovered from 
each of the three-dimensional DNA pools and 
robotically spotted onto filters. Following hybridiza- 
tion to the IRS-PCR marker, the positive signal in 
each pool (Fig.26.8) indicates the correct YAG 
coordinate (floor, column and row) that contains the 
relevant IRS-PCR marker [37,38]. 


26.4.2.2 Extending seed contigs using IRS-PCR 

One efficient way to extend the growing contigs ona 
particular chromosome is to apply IRS-PCR to 
recover new markers to seed contigs [37]. These new 
TRS-PCR markers can be used directly to extend the 
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Fig. 26.8 The use of IRS-PCR products to isolate new 
YAC clones: an efficient route for extending YAC contigs. 
Schematic representation of the isolation of new YAC 
clones using B1 IRS-PCR in order to extend a growing 
YAC contig. PCR amplification using a B1 repeat primer 
of three YAC clones (YAC 1-3) from the end of a YAC 
contig generates a number of IRS-PCR products from 
each YAC when analysed by gel electrophoresis (a). Some 
bands appear to be held in common and may arise froma 
non-chimaeric portion of the YAC contig. Isolation of 
common B1 IRS-PCR products (circled) followed by 
hybridization to filters containing spotted arrays of B1 
IRS-PCR products of YAC DNA pools from the 3-D 
library allows the identification of new YAC coordinates 
(positive floor, row and column) and the potential 
identification of new YAC clones that extend the growing 
YAC contig (b). 


nascent contigs. IRS-PCR products can be recovered 
from the YACs towards the end of a contig and 
separated on an agarose gel (see Fig. 26.8). Identifi- 
cation of a common IRS-PCR product from two or 
more YACs at the end of a contig indicates that the 
product is unlikely to have originated from a 
chimaeric, noncontiguous portion of the YAC contig 
[37]. Subsequently, this IRS-PCR product can be 
cloned or used directly as a probe against filters 
containing IRS-PCR products of a YAC library (see 
above). 

It is again important to emphasize that the 
strategy of extending seed contigs by the use of B1 
IRS-PCR is greatly assisted by the availability of a 
high-resolution genetic map that aids contig 
orientation and closure. As indicated in Fig. 26.7, 
contig orientation often indicates the most efficient 
routes for contig closure by B1 walking or by other 
routes. 


SETHH HH AOHHTHH SEES HOSE KESSH EEE 


26.4.3 Uses of the physical map 


The physical map gives us access to all of the 
underlying sequence in any particular region and, 
most importantly, access to the coding sequences. 
For most of the interesting mutations in both the 
mouse and human genomes, the underlying gene is 
not identified and there may be no suitable candi- 
date gene to test. The only way of accessing the gene 
encoding the mutation is by a strategy known as 
positional cloning [1,39]. As indicated above, genetic 
crosses segregating the mutation can identify the 
most closely linked DNA markers. YAC contigs 
constructed using STSs that closely flank a mutation 
as identified by genetic analysis will contain the 
relevant gene and are an important start-point for 
identification of the relevant locus (see Fig. 26.4). 
YAC clones must be further analysed to identify the 
incumbent gene sequences and isolate potential 
candidates for the mutation. The techniques avail- 
able for gene identification include techniques that 
are commonly used in a variety of organisms 
including: 

1 exon trapping of YAC clones [40,41] (see also 
Chapter 17); 

2 screening of YACs against libraries of cDNA 
clones [42]; 

3 identification of sequences conserved between 
species. 


26.4.3.1 Confirming a candidate gene 

isolated by positional cloning 

Once gene sequences are identified from YACs, it 
is necessary to assess their candidature for the 
mutation that is being positionally cloned. For many 
mutations, there is some understanding of the likely 
site that is affected by the defect and therefore some 
idea of the likely tissue within which the gene is 
normally expressed. Examination of tissue and 
developmental profiles of expression may eliminate 
certain candidates. This may have been taken into 
account when using YACs to screen cDNA libraries. 
The cDNA libraries used will, if possible, have been 
prepared from the tissue that is the likely site of 
expression. Finally, it will be necessary to determine 
by direct DNA sequencing of any candidate locus 
that the gene encodes the mutation. For many 
mouse loci, multiple mutations are available [14] 
and in many cases this will aid confirmation that the 
correct gene has been cloned. 


26.5 Conclusions 


The Mouse Genome Mapping Project aims to 
achieve comprehensive genetic and physical maps 
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across all chromosomes, leading to further profound 
biological insights into mammalian gene function 
and organization. Conserved linkage groups be- 
tween mouse and human chromosomes enable an 
extensive experimental interplay between these two 
pivotal mammalian genomes. Genes isolated and 
mapped in human can be examined as candidates 
for mouse mutations, while, equivalently, genes 
identified in the mouse can be analysed as can- 
didates for mutations underlying human genetic 
diseases. The identification of mouse models for 
human genetic disease will have an increasing 
impact upon human biology and will be aided by 
the rapid expansion in mouse genomics. By the end 
of 1996, we can expect to see the completion of a 
high-resolution genetic map of the mouse genome, 
and by 1997, completion of comprehensive physical 
maps of a number of mouse chromosomes with the 
whole genome physical map following shortly 
thereafter. At the same time, we can expect the rapid 
development of high density transcript maps of the 
mouse genome that will position on the physical 
maps many more genes than currently mapped. 
Detailed physical and transcript maps can be 
expected to give us increasing and easier access to 
the various genes underlying the myriad mutations 
available in the mouse. 

Finally, comparative sequencing of diverse 
genomes, including those of mouse and human, is 
expected to play an increasing role in our analysis of 
genome structure and function. A complete physical 
map of the mouse will provide the templates for the 
development of sequence-ready maps as a prelude 
to the acquisition of large tracts of sequence from the 
mouse genome. 
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27.1 Introduction 


As a direct result of the powerful techniques of 
genome analysis (see Sections 1-4), it has become 
possible to map, clone and sequence individual 
genes, mutations of which are responsible for the 
development of disease. Identification of such 
genes, and the disease-associated mutations, has 
raised the prospect that genetic disease may be 
treatable by direct correction of the underlying 
defect, that is, at the level of the genome itself. 

Gene therapy was initially conceived as a way to 
treat diseases for which a (simple) genetic defect was 
known to be the cause. In its simplest form, gene 
therapy involves the delivery of a functionally 
correct copy of a mutated gene into the affected cells 
in order to obtain long-term correction of the physio- 
logical defect caused by the mutation. An example is 
the treatment of cystic fibrosis by delivery of the 
gene for the cystic fibrosis chloride ion transporter 
(CFTR) protein into the airway epithelial cells of 
patients with cystic fibrosis. 

However, as the number of diseases that are 
known to have at least some genetic component 
increases, the definition of gene therapy has become 
much broader. Now gene therapy is routinely 
evoked to encompass the use of genetic material to 
alleviate the symptoms of a disease, even if the 
therapeutic genes are not strictly ‘corrective’ (in the 
sense of restoring a function known to be mutated in 
the affected cells). Hence, the delivery of cytotoxic 
genes to kill cancer cells (rather than to correct the 
oncogenic mutations within them) is also accepted 
as gene therapy. Therefore, in its broadest terms, 
gene therapy represents ‘an opportunity for the 
treatment of genetic disorders in adults and children 
by genetic modification of human body cells’ [1]. 

A further important classification is to distinguish 
the heritable potential of gene therapy. All of the gene 
therapy trials that are currently approved for use in 
human patients target those somatic cells that will 
live only as long as the patient. Barring inadvertent 
spread of the therapeutic genes to the gametes, the 
genetic treatment will only affect one generation and 
will not be able to alter the genetic make-up of any 
offspring. This is therefore known as somatic gene 
therapy. The purpose of somatic gene therapy is to 
alleviate disease in the treated individual, and that 
individual alone. 

In contrast, it is also possible to target the gametes 
(sperm and ova) directly in order to modify the 
genetic profile, not of the current but of the 
subsequent generation, of unborn ‘patients’. Gene 
transfer at an early stage of embryonic development 
may also have the same effects by achieving gene 


transfer to both somatic and germ line cells. This is 
germline gene therapy. The attraction of germline gene 
therapy for the treatment of disease is that, at least in 
theory, permanent genetic cures might be achieved 
by delivering a functional copy of a mutated gene to 
every cell of the resulting progeny. However, there 
is currently widely held apprehension about the 
development of germline gene therapy research 
programs. The ability to alter the genetic profile of 
subsequent generations rightly invokes many 
spectres. Apart from a complete inability to predict 
the long term sequelae of altering the germline by 
delivery of exogenous genetic material at the 
scientific level, there are many ethical issues raised 
by the prospect of treating ‘patients’ whose consent 
it is impossible to obtain. 

In addition, although it is currently not possible to 
manipulate traits such as ‘intelligence’ or ‘beauty’ 
genetically, there is a perceived fear of such 
technology being abused in eugenic-type breeding 
programs in the future. As a result, the major ethical 
and regulatory bodies of gene therapy in both the 
USA and in Europe have placed a moratorium on 
the consideration of any germline gene therapy of 
human patients because of ‘insufficient knowledge 
to evaluate the risks to future generations’ [1]. 
However, it is important that such issues be 
addressed at a regulatory level sooner rather than 
later. Non-consideration of applications for germ- 
line trials in patients will in no way prevent con- 
tinued research into the direct genetic modification of 
the germline and the relevant ethical and regulatory 
dilemmas will simply be deferred, rather than 
solved, by procrastination. 


27.2 The perfect disease 


Genetically, gene therapy is well advanced for many 
diseases: that is, the underlying genetic defect has 
been identified and the corrective version of the 
relevant mutated gene is available for delivery into 
and expression in target tissues. However, it is the 
imperfections of current in vivo gene delivery 
technologies which currently impose the most 
limiting restrictions upon the practical success of 
most proposed gene therapy protocols [2-5]. There- 
fore, when assessing candidate diseases for gene 
therapy, several considerations must be taken into 
account. The following checklist can be drawn up, 
against which a candidate disease can be compared 
when considering how it compares to the ‘perfect’ 
target disease for gene therapy (Fig. 27.1). 

1 The pathology of the disease should be caused by 
a defect in just a single gene (monogenic disorders), 
the correction of which will restore normal physio- 
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Fig. 27.1 An idealized protocol 
for the gene therapy of a simple 
monogenic disorder. (a) The 
pathology of the disease is 
caused by a mutation (m) in just 
a single gene in the target cells 
(large ovals), which are 
surrounded by uninvolved cells 
(small circles, U) which also 
carry the mutation but are not 
pathologically affected. The 
target cells should be in 
localized, anatomically 
accessible positions for direct in 
vivo gene delivery (for example, 
by direct injection by syringe). 
(b) Following gene transfer, the 
pathology can be reversed by 
simple, constitutive ON/OFF 
regulation of expression of the 
correct version of the gene (—) in 
the affected target cells. In 
addition, correction (c) of the 
overall physiological defect 
should be achievable by delivery 
of the corrective gene to only a 
proportion of the affected target 
cells. The biological properties of 
the gene delivery vehicle and 
expression of the therapeutic 
gene should be nontoxic to 
normal uninvolved cells so that 
perfectly targeted delivery only 
to the affected cell type is not 


Target cell carrying single gene mutation (m) causing physiological defect 


Uninvolved cell type neighbouring target cell 


Delivery of therapeutic gene 
— Normal version of mutated gene (m) 


Target cell into which therapeutic gene has been transferred. 
Expression corrects physiological defect (C) 


Uninvolved cell into which the therapeutic gene has been 
transferred but in which expression causes no toxic effects 


Target cell carrying mutation (m) into which the therapeutic gene 
has not been transferred but in which the physiological defect 
has been corrected (C) via a bystander effect from transduced cells 


required. 


logical function to the affected cells and tissues. 
Hence, affected cells require only one gene to be 
delivered; the probability of delivering more than 
one gene to any given cell in vivo diminishes rapidly 
with increasing number. 

2 The gene which is mutated in such a monogenic 
disorder should have been cloned and the mutations 
which cause disease should be well characterized. 

3 Correction of the physiologcal defect caused by 
the mutation should be achievable by simple, 
constitutive ON /OFF regulation of expression of the 
correct version of the gene. Obtaining temporally 
regulated gene expression in target cells in vivo 
requires inclusion of regulatory elements which 
are still being characterized in most systems; in 


addition, quantitative regulation of exogenously 
introduced gene expression, relative to other endo- 
genous genes in vivo, is also likely to be especially 
problematic. 

4 The biological properties of the gene delivery 
vehicle and expression of the therapeutic gene 
should be nontoxic to normal cells so that perfectly 
targeted delivery only to the affected cell type is not 
required. This will permit relatively promiscuous 
gene delivery without widespread toxicity. 

5 The target cells/tissue for gene correction should 
be in localized, anatomically accessible positions. 
Delivery of a single copy of any gene to every cell in 
the body is currently impossible, other than by 
germline or in utero gene therapy. This requirement 
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will help to overcome the problems with efficiency 
of gene delivery, which is a major limitation to gene 
therapy for most diseases. 

6 Indeed, delivery of a single copy of a gene to every 
cell even in a localized body compartment is highly 
improbable with current technologies. Therefore, 
correction of the physiological defect should be 
achievable by delivery of the corrective gene to only 
a proportion of the affected target cells. 

7 Given the costs associated with the development 
of any new drugs for human use and especially 
considering the heightened safety concerns asso- 
ciated with the use of genetic treatments in human 
to justify the many regulatory hurdles which must 
be traversed for the use of gene therapy, there should 
be no effective currently existing treatment for the 
disease. 


27.3 The real diseases 


In contrast to the idealized situation described 
above, a wide variety of conditions have been 
proposed to be amenable to gene therapy, some 
more realistically than others [6]. These range from 
simple monogenic disorders (e.g. cystic fibrosis), 
which fulfil many of the criteria for the ideal 
candidate disease, through more complex mono- 
genic and multifactorial genetic diseases (e.g. can- 
cer), to diseases where the underlying genetic 
‘defect’ is introduced into the patient in the form of 
pathogenic genomes of bacteria or viruses (e.g. 
HIV). Examples of the spectrum of diseases cur- 
rently under active investigation with genetic ther- 
apies are given below; although it is not possible to 
describe each disease in great detail, examples are 
used from different classes to illustrate the potential, 
and pitfalls, of gene therapy. 


27.3.1 Simple monogenic disorders 


Not surprisingly, the diseases for which clinical 
trials are most advanced, and for which there is the 
most optimism for the clinical outcome, are those 
which conform the closest to the criteria 1-7 above. 
The best examples of such disorders are cystic 


fibrosis and severe combined immune deficiency 
(SCID). 


27.3.1.1 Cystic fibrosis 

Cystic fibrosis is a recessive disorder caused by 
mutation to a single gene encoding a chloride ion 
transporter protein, the CFTR [7]. When a patient 
inherits two mutated copies of the CFTR gene, ion 
transport across epithelial surfaces is disrupted. The 
most life-threatening pathology of CF presents as 


an accumulation of thick mucus in the airways 
accompanied by high risks of bacterial infection. 
This pathology is directly attributable to a defect in 
the chloride ion transport across the airway 
epithelial cells such that water is not secreted into 
the mucus-lined airway passages. However, CF 
patients also have other pathological consequences, 
especially in the gut and pancreas, but these 
conditions are usually managed effectively relative 
to the pulmonary symptoms [7]. 

The CFTR gene was cloned following extensive 
mapping studies and the range of mutations 
associated with the CF phenotype has been well 
documented [7]. In vitro and in vivo studies have 
shown that as few as 30% of cells in sheets of affected 
CF epithelial cells need to express the correct version 
of the CFTR gene for normal physiological levels of 
CI ion transport to be restored to the entire cell layer 
[7]. Transgenic CF mice models have also been 
developed that show physiologically defective Cl- 
ion transport across their airway epithelial cells. 
Unfortunately, these transgenic models do not 
necessarily develop CF-like disease so therapeutic 
gene therapy is difficult to demonstrate, although 
correction of the chloride ion transport defect has 
been conclusively shown [8,9]. 

Therefore, cystic fibrosis represents a near ideal 
candidate for classical gene therapy (Fig. 27.2). The 
pathology is caused by a single gene defect (Section 
27.2, criterion 1) which can be corrected in in vitro 
and in vivo models by expression of the correct 
version of the gene (criterion 2) without the need for 
specific temporal or quantitative regulation of its 
expression (criterion 3). There is no evidence that 
expression of the CFTR gene in other tissues is toxic 
(criterion 4) and expression of the correct gene in 
only a proportion of affected epithelial cells is 
sufficient to restore normal function to epithelial 
cell layers (criterion 6). The target cell population 
for gene correction is relatively accessible to gene 
delivery by aerosols or even direct application 
(criterion 5) and although a range of conventional 
treatments can extend CF patients’ lifespans into the 
mid-thirties, a lack of life-long treatments for CF 
more than justify the investment in gene therapy asa 
curative alternative (criterion 7). 

Therefore, clinical trials of delivery of the CFTR 
gene to affected airway epithelial cells have now 
been approved and are underway in both the UK 
[10] and the USA [11]. In the first instance, these 
trials are aimed at assessing safety and are unlikely 
to show real therapeutic effects, not least because 
various technical hurdles still remain to be 
overcome. For example, although delivery of the 
CFTR expression vector to the airways is physically 
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Fig.27.2 Gene therapy for cystic 
fibrosis. The airway epithelial 
cells of a CF patient lack 
functional CFTR protein and 
cannot pump chloride ions 
across the cell layer. cr cr 
Consequently, water is not 
cotransported into the lumen 
and a life-threatening barrier of 
mucus accumulates. In vivo 
delivery of functional CFTR gene 
into at least some of the affected 
airway epithelial cells should 
generate sufficient Cl ion 
transport across the cell layer 
that enough water is now 
pumped into the lumen to clear 
the mucus barrier from the 
whole airway. 


Lack of Cl’ transport prevents water 
efflux and leads to build up of mucus in lumen 


Airway epithelial cell 
layer, lacking functional 
CFTR protein 


Gli cr cr 


| Transfer and expression of CFTR gene 
into a proportion of epithelial cells 


H20 


Expression of CFTR in only a proportion of epithelial cells restores sufficient 
CI transport and water efflux to correct the CF phenotype 


relatively simple (using either DNA complexed with 
cationic liposomes or high titre CFTR-adenoviral 
stocks), these vectors must penetrate the thick 
mucus to gain access to the epithelial cells before 
physiological correction can occur. It remains to be 
seen whether sufficient epithelial cells can be 
targeted in this way to generate clinical benefits to 
the patients. 

In addition, other confounding factors associated 
with gene delivery make it unlikely that these early 
trials would be truly therapeutic. Since adenoviral 
vectors do not integrate into target cell chromo- 
somes [3], any cells which become successfully 
transduced with the gene are most likely to express 
it only transiently. Hence, repeated administrations 
of viral vector would be required for chronic 
correction in the patient. However, development of 
immunity to the virus may well prevent such 
repeated administrations being effective [12]. In 
addition, inflammation in the lungs of animals 
treated with high-titre doses of recombinant adeno- 
virus has been reported and one patient in a trial in 
the USA has already developed a life-threatening 
inflammatory reaction as a result of immune 
reactivity against very high dose adenoviral stock 
administered into the airway passage [13]. Alter- 
native trials using CFTR expression vector plasmid 
DNA complexed with cationic liposomes seek to 
avoid such inflammatory problems by excluding the 
use of viral vectors. However, what such protocols 
seek to gain in terms of repeatability of dosing, they 
lose in terms of efficiency of gene transfer. Ideally, 
the corrective CFTR gene should penetrate the 
mucous barrier at sufficient levels for at least some 
of the stem cells of the continually self-renewing 


epithelial cell layer to become transduced. Only if 
stem cells can be stably transduced will the need for 
life-long administrations be avoided, a dogma which 
holds for many different forms of gene therapy. 

Initial reports on the in vivo correction of Cl 
transport across small, treated areas (usually of the 
nasal lining of CF patients) are now appearing in the 
literature and look cautiously hopeful [14]. How- 
ever, much technical work remains to be done before 
the inevitable compromises between efficiency and 
safety of gene delivery can be reconciled, and trials 
can proceed to protocols in which genuine clinical 
benefits are expected. 


27.3.1.2 Severe combined immune deficiency 

The second simple monogenic disorder which is at 
the forefront of human gene therapy trials is SCID. 
One form of SCID is caused by the absence of 
functional adenosine deaminase (ADA) in the 
patient’s lymphocytes. However, animal models of 
SCID have shown that T-cell function can be 
corrected by removing affected lymphocytes ex vivo 
and expressing the cloned ADA gene in them. 
Return of ‘corrected’ lymphocytes to the animal can 
then provide sufficient enzyme levels systemically 
so that the immune system can function at normal 
levels [15]. Therefore, clinical trials are now well 
advanced in the USA in which a patient’s 
lymphocytes are removed, transduced ex vivo with a 
retrovirus encoding the ADA gene, and returned in 
vivo to act as a source of ADA (Fig. 27.3) [16]. In this 
instance, many of the delivery problems associated 
with the cystic fibrosis trials are overcome by the ex 
vivo isolation of the target cells, their high level of 
transduction with viral vectors, and the potential to 


654 CHAPTER 27 GENE THERAPY 


SCID patient (no serum ADA) 


| 


| Reconstituted SCID patient 
Serum ADA increased 


ena see ee eee 


T cells recovered ex vivo 
Ce oe Expanded in vitro 
Infected with ADA retrovirus 
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Fig. 27.3 Gene therapy for ADA- 
deficient SCID patients. Patients’ 
lymphocytes are removed, 
transduced ex vivo witha 
retrovirus encoding the ADA 
gene, and returned in vivo to act 
as a source of serum ADA. In 
these first trials, it was 
considered to be ethically 
unacceptable to withold the 
existing treatment of 
recombinant PEG-complexed 
ADA enzyme to patients treated 
with the genetically modified 


lymphocytes. 
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thalassaemia involves both 
effective gene delivery and 
appropriate regulation of gene 
expression. (a) B-thalassaemia is 
the result of a deficiency of B- 
globin chains, relative to a- 
chains, such that the resulting 
haemoglobin tetramers are 
unstable and defective in their 
normal oxygen carriage 
properties. (b) However, 
overexpressing the B-globin 
gene may simply deregulate 
haemoglobin synthesis in an 
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equally detrimental way by 
converting a B-globin 
thalassaemia (relative lack of 
B-globin chains) into an o- 
4h thalassaemia (relative lack of 
alpha globin chains). (c) 
Correctly coordinated 
expression of the introduced 
B-globin gene, relative to the 
endogenous a-globin gene, in 


Mutated 
B-globin gene 


a-thalassaemia — 
relative lack of o-globin chains 


Thalassaemia corrected 


target cells could correct the 
B-thalassaemia (deficiency of 


gain long-term, stable expression of the therapeutic 
gene by using a retroviral vector which integrates 
into the genome [15]. 

The results of these trials (one of the first human 
gene therapy trials to be approved) are very 
encouraging, although in some respects they remain 
ambiguous. Since it was considered to be ethically 
unacceptable to withold the existing treatment of 


B-globin chains). 


recombinant polyethylene glycol (PEG)-ADA com- 
plex to patients treated with the genetically modi- 
fied lymphocytes (who now attend school and are 
apparently well), it has not been possible to attribute 
their continued immune function solely to the gene 
therapy rather than to the conventional treatment. 
None the less, lack of detectable treatment-related 
toxicity, detection of the introduced gene in 
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circulating lymphocytes and continually elevated 
levels of ADA suggest that this form of gene therapy 
may eventually become standard in the treatment of 
the disease. 

Although ADA deficiency is one of the flagships 
of human clinical gene therapy, it is actually a very 
rare disorder. Its adoption as a prototype disease 
is certainly not driven by consideration of the 
existence of large numbers of desperate patients 
(criterion 7 above). Rather, its amenability to the 
requirements of gene therapy (criteria 1-6) mean 
that it is the most likely to work [15]. Therefore, it is 
hoped that apparent success in this disease can be 
used as a justification to proceed with other similar 
monogenic disorders and even with other less ideal 
situations. Examples of such cases include a variety 
of metabolic disorders where the pathology is 
associated with the lack of a single identified 
enzyme [17]. Often, restoration of 5-25% of normal 
serum enzyme activity will protect from clinical 
disease in conditions such as haemophilia B, caused 
by a lack of the blood clotting factor IX. Therefore, 
the relevant gene can be delivered into ectopic 
tissues, or into fibroblasts ex vivo followed by 
implantation of the genetically modified cells to 
serve as a source of serum enzyme. Haemophiliac 
dogs have been ‘cured’ by implantation of cells 
modified by addition of the factor IX gene, or by 
direct gene modification of hepatocytes with 
retroviral vectors encoding the factor IX gene [18], 
and human trials based on these results have been 
proposed. 

In summary, there are several disorders whose 
properties make them conceptually very attractive 
as candidates for gene therapy, as defined by the 
criteria listed above. However, even the most 
theoretically amenable diseases still present many 
technical difficulties which must be overcome before 
gene therapy becomes a routine tool in patient 
management. 


27.3.2 Complex monogenic disorders 


Treatment of certain other monogenic disorders will, 
however, be more complex from both the pragmatic 
and genetic standpoints. In these instances, simple 
replacement of the corrective gene into either the 
normal cells that produce the relevant gene product 
(e.g. airway epithelial cells in CF) or into more easily 
manipulated ectopic tissues (e.g. transplanted fibro- 
blasts for secretion of factor IX), is not likely to be 
sufficient to alleviate disease symptoms. 

For example, some monogenic metabolic dis- 
orders will require gene modification of specific 
tissues that provide cofactors for enzyme activity. 


Hence, correction of phenylketonuria requires 
delivery of the phenylalanine hydroxylase enzyme 
specifically to liver cells because of the cofactors 
produced in hepatic cells necessary for optimal 
enzyme activity. Similar requirements will be 
necessary for treatment of some disorders of glyco- 
gen metabolism or of the urea cycle where normal 
function of therapeutic genes requires additional 
hepatic enzymes [17]. 

Another example of a complex monogenic dis- 
order is the haemoglobinopathies. Thalassaemias 
are the result of a deficiency of globin genes such 
that the resulting haemoglobin structures are 
unstable and/or defective in their normal oxygen- 
carrying properties [19]. As such, it is attractive to 
propose that simple delivery of the missing globin 
genes could be used to correct the relevant 
thalassaemic condition. Thus, expression of the B- 
globin gene in target cells could reverse B-thalas- 
saemia (deficiency of B-globin chains). Unfortunate- 
ly, synthesis of the haemoglobin tetramers involves 
very tight biochemical regulation, characterized by 
both temporal and quantitative controls on the 
production of several different globin species rela- 
tive to each other. Therefore, simply overexpressing 
a particular globin molecule in the cell at any given 
time may simply deregulate haemoglobin synthesis 
in a different way —for instance, by converting a B- 
globin thalassaemia (relative lack of B-globin chains) 
into an a-thalassaemia (relative lack of o-globin 
chains). 

Transcriptional control of the globin gene family is 
known to be highly regulated by tissue-specific 
enhancers and locus control regions (LCRs) [20,21], 
which determine the temporal switching of globin 
chain synthesis during development. Effective gene 
therapy aimed at control of globin synthesis will 
therefore have to incorporate such transcriptional 
regulation into the therapeutic constructs. Although 
retroviral vectors have been constructed that do 
appear to preserve the developmental regulation 
pattern of globin expression [22], there remains 
much to improve before gene therapy of thalas- 
saemias can be confidently advanced into a clinical 
setting. 

In summary, several diseases are caused by 
defects in just a single cloned gene (criteria 1 and 2); 
however, in many cases the therapeutic issue 
focuses not upon the gene itself but on achieving the 
correct levels and timing of its expression (criterion 
3) relative to other proteins with which the gene 
product must interact in the relevant biochemical 
pathways in vivo. Identification of transcriptional 
control elements which can target and regulate gene 
expression promises to be one of the most important 
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advances in the coming years in the field of gene 
therapy. 


27.3.3 Multifactorial genetic disorders 


Many diseases are now known that clearly have a 
genetic component but in which the genetic 
contribution is shared between several genetic loci 
and/or is also enhanced by epigenetic factors (see 
Chapter 2). For example, genetic linkages have been 
variously reported for several psychiatric disorders 
but the degree of genetic and environmental contri- 
butions remains unclear. Even when candidate 
genes for such diseases have been identified, as is 
potentially the case for Alzheimer’s disease, the 
value of the genes for therapy remains unclear 
because of doubts as to the contributions of other 
genes and environmental influences [23]. 

However, an example of a disease with multiple 
genetic components that is widely cited as a target 
for gene therapy is cancer. However, if cystic fibrosis 
and ADA deficiency represent the conceptually easy 
end of the gene therapy spectrum, then cancer 
represents the other extreme [5]. It fulfils hardly any 
of the criteria set out earlier. The evolution of the 
malignant phenotype usually involves multiple 
genetic lesions within the same cell (see below) 
(criterion 1) and it is unlikely that the nature of every 
one of these oncogenic mutations is yet known 
(criterion 2); most cancer patients die because their 
primary cancers spread throughout the body to 
colonize essential tissues and organs as metastases. 
Hence, the target population for gene therapy is 
usually widely dispersed and often not very 
accessible (criterion 5); in addition, unlike the 
situation in CF or ADA deficiency, every tumour cell 
must, in theory, be ‘corrected’ to avoid the emer- 
gence of recurrent disease. Hence, every malignant 
tumour cell must be targeted by the therapy 
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(criterion 6). Therefore, it would seem that cancer 
would not be a natural candidate for gene therapy, 
since the regulated delivery of just a single gene to 
localized areas of affected tissues remains highly 
problematic. None the less, the majority of human 
gene therapy trials currently under clinical assess- 
ment are targeted towards cancer. The ration- 
alization of this almost certainly originates not in a 
common belief that cancer is particularly amenable 
to gene therapy, but rather in the fact that there is a 
large patient population lacking effective, tolerable 
treatments (criterion 7). 

The conversion of a normal cell into a fully 
transformed malignant cell typically involves muta- 
tions in several genes of different classes (Fig. 27.5) 
[24]. Thus, so-called dominantly acting mutations 
convert proto-oncogenes into oncogenes and, 
within the same malignant clone, loss of function 
mutations abrogate the activity of tumour suppres- 
sor genes [25]. The genetic pathway of colorectal 
tumorigenesis is commonly believed to involve 
typically about five genetic mutations (or ‘hits’) to 
both proto-oncogenes (such as RAS) and tumour 
suppressor genes (such as p53, DCC, APC) [26] (see 
Appendix VII for details of genes). Therefore, it is far 
from obvious which of these genetic defects should 
be targeted for correction in a ‘classical’ gene 
therapy approach. It also seems improbable that, 
even if a single mutation could be corrected in every 
tumour cell, the malignant phenotype would 
necessarily be reversed, since the evolution of 
malignancy in human tumours is so multicom- 
ponent in nature. 

None the less, several protocols have been 
proposed in which a mutation that is supposedly 
central to the continued maintenance of the trans- 
formed phenotype is targeted within the tumour 
cells, in the hope that its correction may reverse the 
malignant phenotype or induce apoptosis (Fig. 
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Fig.27.5 The multifactorial basis of cancer. The evolution 
of the malignant phenotype of the cancer cell (in this case 
in the colon) involves multiple genetic mutations of 
different types, in both proto-oncogenes and tumour 


suppressor genes [24,25]. This makes it difficult to predict 
which, if any, of the many possible genetic targets, if 
corrected by gene transfer, would effectively reverse the 
malignant phenotype. 
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27.6a). Therefore, delivery of antisense constructs 
[27] targeted at abrogating the activity of activated 
oncogenes (such as RAS) have been proposed [28], 
as have protocols that seek to deliver a functional 
copy of tumour suppressor genes that are believed 
to be particularly important to maintenance of the 
malignant phenotype [29], such as p53 [30]. How- 
ever, even if these genes are central enough to the 
tumourigenic process in some human cancers to 
be used as rational gene targets, the considerable 
problem remains of delivering at least a single copy 
of the therapeutic gene to every tumour cell carry- 
ing the mutation (although a bystander effect, of 
unknown origin, has been described which appar- 
ently leads to the killing of non-transduced tumour 
cells by an antisense construct to the N-RAS gene 
[28]). Such levels of gene delivery, even to all the 
cells in only a localized tumour mass, let alone 
to systemically dispersed metastatic deposits, is 
currently impossible [5], so such strategies remain 
more hopeful than realistic. 

As a result of these considerations, genetic 
therapies for cancer have been proposed which 
necessarily have led to the creation of a broader 


definition of gene therapy. These strategies [31-35] 
represent a fundamental departure from the gene 
therapies already described, wherein the aim has 
been to preserve affected cells by correcting their 
basic genetic defects. Instead, the majority of gene 
therapy protocols for cancer seek to use noncor- 
rective genes to enhance target (tumour) cell killing. 

Gene therapy can be used to kill tumour cells 
either: 

1 directly, by delivery of a cytotoxic gene to the 
tumour cells themselves; or 

2 indirectly, by the delivery of an immunomo- 
dulatory gene which activates the immune system to 
recognize putative tumour antigens and leads to 
immune-mediated cell killing. 

The delivery of cytotoxic genes to tumour cells has 
been used essentially for the treatment of localized 
tumour deposits, which are accessible for gene 
delivery but are inoperable (Fig.27.6b). The most 
commonly used strategy involves delivering a gene 
encoding an enzyme that will activate a pro-drug to 
a toxic metabolite, leading to the death of the cell 
expressing the gene. An example of such a system 
currently in clinical trials is the herpes simplex virus 
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Fig.27.6 Three possible approaches to the gene therapy 
of cancer [5]. (a) Corrective gene therapy requires the 
delivery of a single corrective gene (e.g. a tumour 
suppressor gene or antisense to an activated oncogene) to 
the tumour cell. Optimally, expression of the corrective 
gene reverses the malignant phenotype of the cell in 
which it is expressed. However, itis unlikely that this will 
have any effect on the continued growth of surrounding 
tumour cells (one hit, one kill). (b) Cytotoxic gene therapy 
leads to the death of the cell expressing the gene as well 


as its near neighbours by a local bystander killing effect 
(a single hit is amplified several-fold, see text for details). 
(c) Immunotherapy involves the expression of an 
immunomodulatory gene in a tumour cell. In theory, this 
‘reveals’ putative tumour antigens, thereby recruiting 
immune effector cells to the deposit to kill similar 
antigen-expressing cells. The activated immune cells can 
also travel to distant sites of occult metastases to kill 
other tumour cells affording systemic protection against 
the cancer. 
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thymidine kinase gene (HSVtk) coupled with the 
antiherpetic drug ganciclovir [36]. This system has 
the added advantage that a local bystander killing 
effect leads to the killing of (non-transduced) cells 
neighbouring the cells expressing the HSVtk gene 
due to transfer of toxic metabolites between 
juxtaposed cells [37,38]. 

In a trial currently in progress at the National 
Institutes of Health in the USA, patients with inoper- 
able gliomas receive retroviral vectors encoding 
HSVtk by stereotactic injection directly into the 
glioma followed by systemic ganciclovir [39]. In this 
design of trial, the chances of success have been 
maximized by reducing the clinical situation to as 
close a classical gene therapy approach as possible. 
Hence, a single gene (criterion 1) is delivered. to a 
localized target tissue (criterion 5) in a manner 
requiring simple ON/OFF regulation of expression 
(criterion 3); the presence of the metabolic bystander 
effect means that the gene does not have to be 
delivered to every one of the target cell population 
(criterion 6). The problem of toxicity following 
inadvertent delivery of the toxic gene to sur- 
rounding normal brain tissue (criterion 4) has been 
partially overcome in this instance by the use of 
retroviral vectors for gene delivery; these viruses 
can only infect dividing (tumour) cells but cannot 
infect the neighbouring quiescent neural tissue [40]. 
Similar trials using cytotoxic genes delivered to 
tumour masses in other anatomical locations will 
require other forms of targeting to ensure minimal 
toxicity to surrounding tissues. Although in its 
infancy, the technology to provide this targeting will 
be provided by engineering specific tropisms into 
the delivery vectors, both at the level of the surface 
of the vector as well as by the use of transcriptional 
targeting [41,42]. 

An alternative approach to cancer gene therapy is 
to deliver genes that enhance the immunogenicity of 
tumour cells, thereby augmenting the immune 
response against them (Fig. 27.6c) [33,43,44]. Use of 
the immune system presents three major theoretical 
advantages for cancer gene therapy. 

1 If it can be activated to recognize tumour-specific 
antigens on tumour cells, the specificity of the 
immune response should mean that systemic toxi- 
city is reduced to a minimum, since only tumour 
cells expressing the antigens will be killed. 

2 Once activated, immune responses have a natural 
response amplification mechanism so that only a 
small stimulus (low levels of gene transfer) is 
required to produce a large response. That response 
should, in theory, be body wide and protect against 
recurrence of disease. 

3 Recruitment of the body’s own immunity to 


recognize and destroy the tumour cells should be far 
less toxic than current treatments such as chemo- 
therapy and radiotherapy. 

In effect, if an immune response can be effectively 
activated against tumour cells, the burden of gene 
delivery efficiency, specificity and inadvertent toxi- 
city should be transferred from the gene therapist 
onto the immune system. 

There is now good evidence that at least some 
tumours express tumour antigens which can be 
recognized under certain circumstances by the im- 
mune system [45]. Therefore, it has been proposed 
that expression of various types of immunostimu- 
latory molecules in tumour cells might enhance 
immune recognition, possibly by overcoming intrin- 
sic defects in the pathways of antigen presentation 
by tumour cells [46]. The hope is that tumour cells 
engineered to express such molecules, either ex vivo 
as vaccines or directly by in vivo gene delivery, will 
generate long lasting immunity to unmodified 
tumour cells growing at distant sites in the body. 
Results from animal models have been encouraging, 
using tumour cells modified to express cytokines 
(e.g. interleukin-2 and -4 (IL-2, IL-4), granulocyte— 
macrophage colony-stimulating factor (GM-CSF), 
and interferons (IFNs)) [43,47], costimulatory mole- 
cules (e.g. members of the B7 family) [48,49], MHC 
molecules [50], allogeneic antigens [51] and syn- 
geneic tumour antigens [52]. Human clinical trials 
are under way to see if these results translate into 
clinical gains in humans [43,47]. 

A modification of this approach has been to use 
immune cells recovered from excised tumours in 
adoptive transfer protocols [53]. Hence, immune 
cells infiltrating certain human tumours, principally 
melanoma, renal cell cancers and colorectal cancers, 
have been grown ex vivo to high numbers and rein- 
fused into patients. These immune cells presumably 
have natural tumour recognition capabilities since 
they are originally isolated from growing tumours; 
when reinfused they should circulate through the 
body and concentrate in metastatic deposits, 
expressing whatever antigens they are primed to 
recognize (Fig. 27.7). 

Initial patient trials using nonT/nonB cell 
tumour-infiltrating lymphokine-activated _ killer 
(LAK) cells in adoptive immunotherapy [54] were 
superseded by the use of a more specific T-cell 
population of IL-2 expanded tumour-infiltrating 
lymphocytes (TILs) [55,56]. Although these trials 
have reported only limited clinical success, TIL 
populations are now being used in gene therapy 
experiments. TILs recovered from patients will be 
engineered ex vivo to express either IL-2 or tumour 
necrosis factor (TNF) and will then be re-infused 


659 CHAPTER 27 GENE THERAPY 


»®@ 


Primary 
tumour 


Tumour __ (a) 
resection 


¥ x 
(e) Secondary () - ems: 
tumour 


Tumour cell 
In vitro BS 
culture —— 
Le 
@ Tumour-infiltrating 
cells 


| (b) 


In vitro 
expansion of 
infiltrating cells 


Fig. 27.7 Adoptive immunotherapy with tumour- 
infiltrating immune cells. The immune cells infiltrating a 
tumour can be recovered from the excised tumour (a), 
grown ex vivo to high numbers (b) and reinfused into 
patients (c). These immune cells presumably have natural 


into the patient [33]. The TILs are effectively being 
used as tumour-specific delivery vehicles to express 
immune activating and/or tumoricidal cytokines at 
high concentrations within tumour deposits. It 
is not possible to reach therapeutically useful 
concentrations of such cytokines, especially TNE, by 
systemic administration of recombinant proteins 
because of the toxic effects associated with such 
treatments in humans. However, several technical 
difficulties have been encountered in achieving high 
levels of cytokine expression in patients’ TILs. This 
combination of TIL and gene transfer is attractive if 
the TILs genuinely can localize to tumour deposits 
which the clinician cannot find/treat but the in vivo 
efficacy of TIL in most tumour types remains 
controversial. 

Finally, gene therapy has been proposed as a 
means of improving the efficacy of conventional 
chemotherapeutic treatments. One of the most com- 
mon causes of treatment failure is the emergence 
of drug-resistant tumour cells [57,58] which no 
longer respond to levels of chemotherapy that are 
acceptable to the patient. If chemotherapy doses 
could be increased, without the associated bone 
marrow toxicity, it may be that chemotherapy could 
be more effective against these resistant clones. 
Therefore, it has been proposed that transfer of the 
gene encoding the multidrug resistant protein 
(MDR-1) [57] into bone marrow cells may allow 
increased dosing with chemotherapeutic drugs [59]. 
Drug levels might be attainable which will now be 
toxic to tumour cells but will still be acceptable to the 
modified marrow because MDR protein actively 
pumps various chemotherapeutic drugs out of cells 
which express it. Chemoprotective gene therapy of 


tumour recognition capabilities since they are originally 
isolated from growing tumours; when re-infused they 
should circulate through the body (d) and concentrate in 
metastatic deposits expressing whatever antigens they 
are primed to recognize (e). 


bone marrow cells has been effective in animal 
models [60] and may prove clinically beneficial in 
dose escalation regimens in human patients. 

In summary, gene therapy for diseases, such as 
cancer, that have a multifactorial genetic com- 
ponent, presents many more theoretical problems 
than for the simple monogenic disorders such as CF 
or ADA deficiency. For cancer, in particular, the 
scope of gene therapy has been expanded to include 
the use of cytotoxic and immunomodulatory genes, 
as well as the more conventional corrective 
approaches which are more analogous to CF or ADA 
deficiency. However, reduction of the clinical target 
to as close to the CF-type situation as possible may 
increase the chances of success for specific clinical 
situations (such as the treatment of gliomas with the 
HSVtk/ ganciclovir system). 


27.3.4 Infectious diseases 


In theory, gene therapy for infectious diseases is 
attractive because the invading organism introduces 
pathogen-specific genetic material which is an ideal 
target for genetic intervention. For example, anti- 
sense oligonucleotides can be synthesized with 
high specificity for gene targets upon which repli- 
cation of the pathogen is dependent, but which 
should not recognize any cellular genetic material 
[61]. Host target cells could then be transduced with 
such pathogen-protective constructs such that they 
become resistant to productive infection. Such 
approaches have been suggested to treat protozoan 
parasite infections for which drug therapy is 
currently inadequate [61]. 

Viral infections offer similar opportunities for 
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specific genetic interventions. Indeed, in cancers 
with a known viral aetiology, the presence of viral 
genes, upon which the evolution of the malignant 
phenotype depends, offers more cause for optimism 
than in the treatment of nonviral cancers, because of 
the presence of specific targets which are separate 
from cellular genes. Therefore, gene therapy de- 
signed to abrogate the expression of papilloma 
transforming proteins E6 and E7 might be effective 
in treatment of cervical cancer; similarly, hepatitis B 
(hepatocellular carcinoma), human T-cell lympho- 
tropic virus types 1 and 2 (adult T-cell lymphoma/ 
leukaemia) and Epstein-Barr virus (nasopharyngeal 
carcinoma and Burkitt’s lymphoma) all offer virus- 
specific targets for gene therapy intervention in the 
infected target cells [62]. 

Similarly, gene therapy is becoming an increas- 
ingly attractive option for the treatment of AIDS in 
the continuing absence of an effective vaccine or 
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drug treatment for the human immunodeficiency 
virus (HIV) [63]. HIV is a complex retrovirus whose 
genome expression is controlled by a series of 
regulatory proteins which control levels of viral 
protein production and the switch from latency to 
productive infection [64]. One of these proteins, TAT, 
is an obligatory transcriptional activator of the viral 
promoter in the long terminal repeat (LTR). It may 
be possible to use the complexity of the control of 
genome expression against the virus to protect the 
principal target of HIV infection, the CD4* T cells. 
For instance, T cells removed ex vivo can be 
transduced with constructs that use the HIV LTR to 
direct expression of a suicide gene such as the HSVtk 
gene (see earlier) (Fig. 27.8) [65]. When these T cells 
are returned in vivo the absence of TAT will prevent 
expression of the tk gene. However, if the modified T 
cells become infected with HIV, the wild type virus 
will provide TAT in trans and expression of the 
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Fig. 27.8 A possible approach to T-cell protective gene 
therapy for infection with the human immunodeficiency 
virus (HIV). (a) In an HIV-infected T cell, the integrated 
proviral LTR promoter has only basal activity, which is 
insufficient to drive expression of viral structural 
proteins (1). However, this basal transcription is 
sufficient to lead to small levels of expression of TAT 
mRNA (2) and protein (3). TAT feedsback on the viral 
LTR to activate transcription and expression of the viral 
structural proteins is greatly amplified (4) so that the 
infected cell becomes a source of virus production. 


(b) Uninfected T cells can be removed from the body and 
transduced with a TAT-dependent vector in which 
expression of the HSVtk suicide gene is dependent upon 
the HIV LTR and, hence, on the presence of TAT in the 
cell. These modified T cells are returned in vivo. (c) If 
such a modified T cell then becomes infected with wild- 
type HIV, production of TAT by the infecting HIV 
upregulates expression from the TAT-dependent LTR- 
HSVtk construct and the T cell becomes sensitive to 
ganciclovir. Therefore, infected T cells can be killed in vivo 
before they produce more infectious HIV. 
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transgene HSVtk will be activated. Treatment of the 
patient with ganciclovir would kill the infected T 
cells before they could serve as a reservoir of viral 
production, thereby limiting the ability of HIV to 
infect more cells. 

However, such approaches would be unlikely 
to abolish infection and would, at best, only slow 
the progression of disease. Other gene therapy 
approaches have also been proposed which seek to 
interfere specifically with viral replication steps 
without killing the infected T cells [66]; these in- 
clude the transduction of CD4* T cells with TAT- 
dependent HIV-specific antisense or ribozyme [67] 
constructs [68]. So far, in vitro experiments have 
shown promising results in that these constructs can 
protect tissue culture cells from infection with HIV. 
Applications are currently being approved for trials 
in HIV-infected patients. 


27.4 Delivery systems for 
gene therapy 


From the preceding discussions of the clinical 
situations in which gene therapy may have a role for 
the future, it is clear that the principal constraint is 
the ability to deliver the therapeutic gene effectively 
to the target cells. Vector systems must achieve gene 
transfer, depending upon the different clinical 
targets, with varying degress of efficiency, accuracy, 
stability and safety. These properties will be 
discussed in general terms below, but for detailed 
reviews of the properties of individual gene delivery 
systems the reader is referred to other reviews [3-5]. 


27.4.1 Efficiency of gene transfer 


Physical, non-viral methods of gene transfer have 
been described for the transduction of cells both in 
vitro and in vivo. 

Generally, the most efficient means of delivering 
genes to cells in vivo has been by complexing the 
DNA with cationic lipid and either injecting the 
complex into the target tissue directly (e.g. a 
tumour) [51], intravenously [69] or by direct appli- 
cation onto the target tissue [10]. However, these 
methods are usually much less efficient than virus- 
mediated vectors. Viruses are natural genetic 
vectors and have optimized their life cycles for the 
carriage of genes into target cells. The use of 
replication-defective, recombinant viral vectors has 
greatly increased the possible efficiencies of gene 
transfer in vivo. To date, only recombinant retroviral 
vectors and adenoviral vectors have been used in 
clinical trials [2]. Each has specific advantages and 
disadavantages which are reviewed elsewhere [3,4, 


70]. With current vectors the order of decreasing 
efficiency of titres is: adenoviral vectors > retroviral 
vectors > plasmid vectors. 

In order to improve existing efficiencies, novel 
liposome formulations are being developed for 
plasmid-based delivery [69,71] and improvements 
to viral titres have been achieved by various means 
[72]. None the less, currently available vectors often 
lack sufficient titres for the demands of the clinical 
situation and improvements in this area will be 
necessary especially where the target cell population 
is very large (such as tumours). These consider- 
ations have led to suggestions that the only way to 
achieve sufficient titres for certain disorders is to 
develop replication-competent vectors which can 
initiate spreading infections within the target cell 
population [73] but which have inbuilt safety 
features to prevent their spread to other cell types 
[70]. Currently, however, the use of such replicating 
vectors remains strictly a development for the 
future. 


27.4.2 Accuracy of gene transfer 


Ideally, the therapeutic gene should be delivered/ 
expressed only in the target cells to prevent any 
treatment-related toxicities, although the impor- 
tance of this requirement depends heavily upon the 
type of gene being used [5]. 

Accuracy of delivery of the vector can be achieved 
at several levels [41,74]. The vector can be delivered 
to the target area by physical means such as 
stereotactic injection into tumour deposits (HSVtk) 
[39] or topical application onto airway epithelial 
cells (CF) [75]. However, more sophisticated genetic 
means of gene targeting are required for vectors 
which encode potentially toxic genes and/or which 
are delivered systemically. 

Vector-specific targeting has been used to target 
HSVtk encoding retroviral vectors to replicating 
glioma cells whilst avoiding infection of quiescent 
neural tissue around the tumour [40,76]. In addition, 
cytokine genes such as the IL-2 and TNF might be 
targeted to tumour deposits using the intrinsic 
tumour-homing properties of TILs [33]. Surface 
targeting of the delivery vehicle would be desirable, 
such that it only infects the appropriate cells. 
Incorporation of antibodies or ligands into lipo- 
somes [77] can target physical delivery of drugs and 
plasmids and engineering of (retro)viral envelopes 
may eventually allow cell-specific infection to occur 
via recognition of target cell-specific molecules 
(such as tumour antigens) [78-80]. To date, the most 
effective targeting has been achieved at the trans- 
criptional level by inclusion of cell-type specific 
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enhancer/locus control regions in to both plasmid 
and retroviral vectors [81-83], thereby restricting 
gene expression to target cell types even if delivery 
occurs to surrounding cells. Ultimately, the hope is 
that delivery vehicles will be developed which 
incorporate targeting at several levels including 
transcriptional and surface specificity [74]. 


27.4.3 Stability of gene transfer 


To avoid the need for repeated administration of 
gene therapy, stable integration of a corrective gene 
into at least some of the self-renewing stem cells: of 
the target cell population would be the ideal result of 
a single treatment dose. For diseases such CF or 
ADA deficiency, stability of expression is clearly of 
great importance. Plasmid DNA and retroviral 
vectors can integrate into host cell chromosomes 
(essentially at random sites), although retrovirus- 
mediated integration is much more efficient and 
precise. Adenoviral vectors, however, are main- 
tained episomally in infected cells and are diluted 
out of the target cell population when the cells 
divide. Therefore, the order of decreasing efficiency 
of generating stable gene expression is: retroviral 
vectors > plasmids > adenoviral vectors. 


27.4.4 Safety of gene transfer 


A major concern about the advent of genetic ther- 
apies for patient treatment is the uncertainty of the 
consequences of introducing new genetic material 
into patients’ cells. The use of plasmid DNA alone is 
perceived as carrying less threat than the use of viral 
vectors, partly because less DNA is usually trans- 
ferred and partly because viral vectors usually retain 
viral regulatory sequences to improve efficiency of 
gene transfer. In the case of retroviral vectors, these 
regulatory sequences may cause activation of 
nearby cellular proto-oncogenes following viral 
integration, leading to transformation of the target 
cell [84-86], although the estimated risk of this 
occurring is low [87]. 

In addition, although in vitro tests for replication 
competent viruses are well developed, especially for 
retroviral vector stocks, there is a finite chance that 
contaminating, potentially pathogenic replicating 
viruses might be cotransferred into patients along 
with the recombinant stocks [88]. However, the 
amount of such replication-competent retrovirus 
that must be transferred to a patient to cause disease 
appears to be much greater than the quantities that 
can routinely be detected by current in vitro safety 
tests [89,90]. There is also a risk that naturally 
occurring, superinfecting viruses may rescue novel, 


pathogenic viruses by recombination between the 
wild-type virus and the vector genome. The risks 
associated with the generation of such new 
‘doomsday’ viruses are difficult to quantify but 
probably represent more of a conceptual, than a real, 
risk. 

Recombinant viral stocks are also naturally im- 
munogenic by displaying viral antigens on their 
surfaces [12]. This may hinder the repeated use of 
such stocks if more than one treatment is required as 
immunity to the antigens would be expected after a 
single dose. Moreover, immune responses to even a 
single dose might be damaging to the patient, as 
seen in a potentially life-threatening inflammatory 
reaction of a CF patient treated with very high titre 
adenoviral stock (see earlier). Therefore, a ranking of 
currently used vectors for safety, in decreasing order, 
would be: plasmid vectors > retroviral vectors > 
adenoviral vectors. 


27.4.5 Perspectives 


Of those vectors that have currently been approved 
for use in human trials, no single vector system is 
likely to possess all the desired attributes for any 
given situation. The ranking of vectors for safety, 
efficiency and stability of gene expression does not 
give concordant results. Therefore, there is often 
likely to be conflict in the choice of the optimal 
vector system to use for any particular trial. For 
instance, where high efficiency of gene transfer 
should ideally be combined with long-term stable 
gene expression (such as in the treatment of CF), a 
compromise must be made between the high-titre 
adenoviral vectors, the stable integration of retro- 
viral vectors and the safest option of plasmid 
transfer. In other situations, the dilemma as to which 
system to use may be less acute; for instance, tran- 
sient expression of the HSVtk gene, or an immuno- 
modulatory gene, in tumour cells would probably 
be sufficient for cytotoxic or immunological gene 
therapy of cancer, and in these cases the efficiency of 
adenovirus-mediated gene transfer may prove to be 
optimal. 

Other viral vectors, not yet approved for clinical 
use, are currently in development, including herpes 
simplex virus, parvovirus and adeno-associated 
viruses [4]. As the number of vector systems that are 
well characterized enough to be used safely in 
patients increases, so the conflicts between the 
different requirements of each system should be 
easier to resolve. It may also soon be possible to 
synthesize custom-designed delivery vehicles by 
incorporating the best features of different vectors 
into hybrid constructs which have the specific, 
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combined properties required for the gene therapy 
protocol of choice [74]. 


27.5 Prospects 


When contemplating the ‘perfect’ disease for which 
intervention by gene therapy stands the greatest 
chance of success, several criteria can be proposed. 
This disease should be a simple monogenic disorder 
for which the gene has been cloned, and should 
require simple high-level gene expression for 
correction; the corrective gene should not be toxic to 
other cells if inadvertently expressed in them, but 
the affected cells should be accessible to gene 
delivery in a localized group, and correction of the 
target cells should be achievable even if only a 
fraction of the cells actually receive the gene. Finally, 
it would only be worthwhile developing gene 
therapy if the disease has no simple, safe and cheap 
treatment already. 

In reality, gene therapy, in some form or other, has 
been proposed for a range of different diseases, and 
this list will continue to expand rapidly, although 
many do not conform at all closely to the above 
checklist. The most idealized of real diseases for 
which gene therapy has apparently been most 
successful, ADA deficiency, actually has fewer 
patients who suffer from the disease than researchers 
working on it. In contrast, cancer, the disease 
which least fits these criteria, is the one for which 
the majority of human trials currently exists, prin- 
cipally because many cancers have such poor pro- 
gnoses that any novel therapeutic approach can 
be justified on the grounds of patient desperation. 

Gene therapy for cancer and infectious diseases, 
such as HIV, has also led to a general broadening 
of the definition of gene therapy away from the 
original concept of the use of genes to correct genetic 
defects within target cells. However, the unrealistic 
expansion of the remit of gene therapy for treatment 
of disease also poses some serious threats to the 
credibility of the field for the future. Inflated claims 
regarding its clinical potential, in part as a justi- 
fication to obtain dwindling research funding, will 
raise expectations so high that even moderate 
clinical success in a few limited disease situations 
will be unable to fulfil the over-hyped promise 
associated with gene therapy. It is important to 
define realistic and obtainable goals which gene 
therapy might actually be able to achieve. These 
goals will only be sensible if there is a clear 
knowledge of the capabilities, and limitations, of the 
gene delivery vectors which are currently available 
and these should be well understood. 

We are currently in an exciting phase where the 


results of the first human clinical trials of gene 
therapy are beginning to be reported. The first 
priority is to ensure that the treatments adminis- 
tered to patients are safe and do not cause adverse 
reactions. It is unlikely that these early trials will 
show therapeutic effects, partly because of their 
inherent design and partly because it is generally 
end-stage patients who have been recruited. 
Provided no unforeseen toxicities are reported, the 
next decade should see gene therapies being 
administered to patients at earlier stages of disease, 
in circumstances where they may begin to have 
therapeutic effects. Eventually, it is to be hoped that, 
in certain well-designed clinical situations, gene 
therapy may emerge as effective adjuvant therapy 
for pre-existing treatment modalities and even, in 
some cases, as the treatment of choice in diseases as 
diverse as cystic fibrosis, cancer and HIV. 
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28.1 Introduction 


The fruit fly Drosophila melanogaster has been at the 
forefront of genetic research since it was adopted as 
an experimental organism by T.H. Morgan in the 
early years of this century. Drosophila has many 
features that make it an eminently suitable organism 
for laboratory research. Drosophila research has 
produced excellent cytological maps of the larval 
salivary gland polytene chromosomes, which have 
served their purpose well in the genetic analysis of 
this organism. Nevertheless, in the modern era of 
molecular genetics, the availability of a molecular 
map has become essential. It is often mistakenly 
assumed that the primary purpose of a molecular 
map is to facilitate whole genome sequencing. This 
is not the case for D. melanogaster: there is a wealth of 
existing genetic information that will be tied in with 
the molecular map. This includes a great many 
chromosome rearrangements whose breakpoints 
are known at the level of resolution afforded by the 
polytene chromosome map, several thousand charac- 
terized and mapped genes, many of which have 
been cloned, and a large collection of transposon 
insertions. In this chapter I shall review the current 
status of genome mapping projects in Drosophila, 
in the context of its use as an experimental model 
organism. Because Drosophila has immense value as 
a model for a very wide variety of biological and 
biomedical studies, I shall discuss the features 
which make it such a powerful experimental model. 


28.2 Drosophila as a model system 


D. melanogaster is highly amenable to laboratory 
study. Its requirements for culture are extremely 
modest, and large-scale genetic experiments are 
easily carried out. The life cycle of Drosophila is 
typical of holometabolous insects. Embryonic devel- 
opment is rapid, with larvae hatching about 22h 
after fertilization. After hatching, the larva grows 
through three larval instars before pupation. During 
the pupal stage, the animal metamorphoses into the 
adult. At 25 °C, the life cycle takes ~ 10 days. Several 
features of Drosophila development are important 
with respect to its use as a model organism. The 
maternally provided RNA and proteins fuel the 
embryo’s development through the early syncytial 
stages of development, and through to cellulariza- 
tion. Indeed, this maternal provision can exert an 
effect on the progeny’s development well into larval 
development (a phenomenon known as perdurance). 

A consequence of this division of the life cycle is 
that different mutant alleles of a locus may have 
different phenotypic effects, and lethal phases. One 


example is the cell cycle gene polo [1], which is 
required for progression through mitosis. The stages 
at which cell division is required for viability in 
Drosophila are embryogenesis and metamorphosis, 
as most larval growth is by cell enlargement and 
endoreduplication of chromosomes during the three 
larval instars. Thus, weaker alleles of polo may yield 
homozygous adults, whose progeny die as embryos 
owing to insufficient maternal provision of the polo 
gene product (maternal-effect lethality), while nulls 
or strong hypomorphic alleles will not allow 
development of homozygotes through metamor- 
phosis, with a consequent late larval lethal period. 
Null homozygotes can develop because their 
maternal provision of polo gene product is sufficient 
to permit the embryonic mitoses in the absence of 
functional zygotic polo. Further examples can be 
seen in the analysis of genes required to set up the 
segmental body plan. Such genes have been iden- 
tified in screens of maternal-effect and zygotic 
lethals [2,3]. Some of the mutations isolated in this 
way have turned out to be alleles of viable mutants 
with a visible phenotype. 

The relevance of Drosophila to modern medical 
and biological research stems from the conservation 
of basic biological processes. Examples are numer- 
ous. The differentiation of photoreceptor seven in 
the compound eye, and of the terminal structures in 
the embryo, have been shown to be mediated by 
receptor tyrosine kinases and signal transduction 
pathways very similar to those in vertebrates. 
Moreover, the biological function of genes identified 
by recessive lethal mutations can be directly studied 
in vivo by using mitotic recombination to generate 
homozygous mutant clones of cells within a viable 
background. 

There are a number of books available which 
describe the biology of Drosophila. Perhaps the most 
useful in describing the general biology is that 
edited by Demerec [4]. Ashburner has published a 
single volume monograph dealing with all aspects 
of Drosophila genetics and biology [5], supplemented 
by a useful volume of methods [6]. An invaluable 
sourcebook on genetic loci and chromosome 
rearrangements, the ’Red Book’, has recently been 
updated [7] and can be accessed electronically via 
FlyBase (see Section 28.3.5). 


28.2.1 Genetic mapping 


The first visible mutations of D. melanogaster, speck 
and white, were discovered in 1910 [8] and were 
rapidly followed by many others. A recombination 
map using six sex-linked mutants, the first in any 
organism, was conceived by Sturtevant and pub- 
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lished in 1913 [9]. This map actually orders five loci, 
since two of the sex-linked factors turned out to be 
alleles of white. By 1925, the genetic map of D. 
melanogaster consisted of about 400 loci allocated to 
four linkage groups [10]. At that time, genetic 
research had also been carried out on some other 
Drosophila species, including D. simulans, a closely 
related sibling species [10]. Many mutants were 
found in D. simulans which had their homologous 
counterparts in D. melanogaster, as judged by pheno- 
typic and mapping analysis. This work revealed the 
existence of a large inversion on chromosome arm 
3R in D. simulans relative to D. melanogaster. 

At the time of writing, over 7300 mutants have 
been characterized and mapped (FlyBase, personal 
communication). Many of these loci have been 
cloned and subjected to detailed molecular analysis. 


28.2.2 Cytogenetics 
28.2.2.1 Mitotic chromosomes 


The genome of D. melanogaster comprises four pairs _ 


of chromosomes (Fig. 28.1). The sex chromosomes 
are heteromorphic: the X is subtelocentric, while the 
Y chromosome is entirely heterochromatic. In Droso- 
phila, the primary signal for sex determination is 
provided by the ratio of X chromosomes to sets of 
autosomes: if the ratio is 1:2, the fly is male; if it is 
1:1, the fly is female. Unlike mammalian systems, 
the Y chromosome is not male determining, but is 
required for male fertility, and XO individuals are 
fully viable, but sterile, males. The genetics of sex 
determination has been well characterized in 
Drosophila, revealing a cascade of genetic interac- 
tions which have illuminated many basic biological 
functions [11]. Again in contrast to mammalian 
systems, the process by which the level of expression 
of X-linked loci is adjusted to be the same in both 
males and females is not by inactivation of one of the 
X chromosomes in females, but by an increase in the 
transcriptional activity of the X chromosome in 
males (dosage compensation) [12]. 

The two major autosomes, the second and third 
chromosomes, are metacentric chromosomes in 
metaphase preparations, while the tiny fourth 
chromosome is dot-like. Each chromosome can be 
subdivided into euchromatin and heterochromatin, 
the heterochromatin being principally located 
around the centromeres. The definitions of hetero- 
chromatin and euchromatin are essentially morpho- 
logical; heterochromatin remaining more condensed 
than euchromatin during the interphase of the cell 
cycle. The majority of the genetic loci are located in 
the euchromatic regions of the chromosomes. 

The heterochromatic regions of the chromosomes 


have been the subject of intensive chromosome 
mapping, using DNA-intercalating fluorochromes 
(see, for example, ref. 13), although this provides a 
much lower resolution than is possible with poly- 
tene chromosome mapping. Nevertheless, these 
techniques have been fundamental to the genetic 
analysis of the Y chromosome, which contains no 
loci essential for viability, but several required for 
male fertility. 


Euchromatin and heterochromatin Heterochromatin is 
defined on the basis of its condensation behaviour 
during the cell cycle, generally remaining in a 
condensed state during interphase, although with 
the correlation of satellite DNA with hetero- 
chromatin, the distinction between satellite DNA 
and heterochromatin has become a little blurred. 
Heterochromatin in Drosophila can be divided into 
two classes, a- and B-heterochromatin. In polytene 
nuclei, the o-heterochromatin is entirely unpoly- 
tenized, and appears as a small dot at the chromo- 
centre, while B-heterochromatin is located at the 
bases of the chromosome arms at intermediate levels 
of polyteny, with a fuzzy, poorly banded appear- 
ance. Many transposable elements are known to be 
accumulated within the B-heterochromatin. 

There are few genes located within heterochro- 
matin, which in general appears to be in a transcrip- 
tionally inactive state. Chromosome rearrangements 
which bring euchromatic genes into close proximity 
to heterochromatic regions often display a pheno- 
menon known as position effect variegation [14]. For 
example, when the white gene is relocated near 
toheterochromatin, a variegated or patchy distribu- 
tion of white* activity can be seen in the ommatidia 
of the compound eye. The molecular explanation 
for this phenomenon is at present rather unclear, 
although many suppressors and enhancers of 
position effect variegation are known and have been 
characterized. These genes are implicated in the 
determination of chromatin structure. Interestingly, 
position effect variegation has been observed ‘in 
reverse’ for a heterochromatic gene, light [15], in 
which the expression of light is reduced when 
relocated by chromosome rearrangement to a 
euchromatic location. 


28.2.2.2 Polytene chromosomes and 
cytogenetic mapping 
Polytene chromosomes had been discovered in 
Chironomus in 1881 by Balbiani [16], though it was 
not until T.S. Painter published his work on the 
mapping of chromosomes in 1929 [17] that their 
importance to genetics was fully realized. 

Polytene chromosomes are rather curious struc- 
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Fig. 28.1 Polytene and mitotic chromosomes of 
Drosophila melanogaster. The five major chromosome arms 
can be seen extending from the chromocentre. The 
chromocentre contains the centromeres, and the 
unpolytenized pericentromeric heterochromatin, with (in 
males) the heterochromatic Y chromosome. Note the 
characteristic transverse banding pattern, which is the 


basis of Bridges’ map. The inset shows the mitotic 
complement of D. melanogaster at lower magnification. 
The two major autosomes, chromosomes 2 and 3, are 
clearly distinguishable from the other chromosomes of 
the complement. In favourable preparations, these 


chromosomes can be distinguished from each other by a 
secondary constriction of chromosome 2. 
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tures produced by the continued replication of 
chromosomal DNA in the absence of mitosis, and 
persistent synapsis of the homologues in a state of 
condensation resembling interphase. In mature 
third instar larval salivary gland nuclei of Drosophila, 
the polytene chromosomes contain ~1000 tightly 
synapsed DNA molecules, yielding a characteristic 
and reproducible pattern of transverse bands. It is 
this banding pattern that is the real key to the utility 
of polytene chromosomes for mapping genetic loci 
and chromosome rearrangements. The polytene 
chromosome complement is shown in Fig. 28.1. The 
chromosome arms are joined at the chromocentre, 
which consists of the under-replicated pericentric 
heterochromatin. A diagrammatic representation of 
the structure of polytene chromosomes is shown in 
Fig. 28.2. 

It should be emphasized at this point that the 
polytene chromosomes represent only the euchro- 
matic fraction of the Drosophila genome: the hetero- 
chromatin, located close to the centromeres and 
throughout the Y chromosome is under-replicated in 
polytene nuclei. Heterochromatin, while represen- 
ting 35% of the genome, is essentially absent from 
polytene chromosome maps, and indeed. their 
molecular derivatives. This is not generally con- 
sidered a problem since the great majority of genetic 
loci are found in the euchromatin. 

The real breakthrough in the cytogenetic mapping 
of loci to the polytene chromosomes was the 
inspired mapping system devised by Bridges in 1935 
[18]. Bridges’ achievement was the adoption of a 
nomenclature system by which a particular band on 
the polytene chromosomes might be recognized 
virtually unambiguously by any other investigator. 
In so doing, he created what is effectively a usable 


physical map of the Drosophila genome capable of 
resolving loci as close as a few tens of kilobases in 
modern terminology. The importance of these 
chromosome maps for the course of genetics as a 
science cannot be overstated. 

In Bridges’ map, each major chromosome arm is 
divided into 20 sections, or divisions, and each 
division is subdivided into six subdivisions labelled 
A to F. In most cases, subdivisions begin with an 
easily recognized heavy band. This scheme allocates 
divisions 1-20 to the X chromosome, 21-40 and 
41-60 to the left and right arms, respectively, of the 
second chromosome, and 61-80 and 81-100 to the 
left and right arms, respectively, of the third 
chromosome. The minute fourth chromosome, 
which appears as a dot in metaphase spreads, was 
allocated divisions 101 and 102. The divisions at the 
bases of the chromosome arms (divisions 20, 40, 41, 
80, 81 and 101) have generally poorly defined band 
morphology, associated with the increasing quanti- 
ties of B-heterochromatin found in these regions. 
Since its introduction 60 years ago, this map has 
been of central importance to genetic studies in D. 
melanogaster and its sibling species, and it has been 
used as a model for polytene chromosome maps in 
many other Drosophila species, other Diptera such as 
mosquitoes, and indeed for species of some of the 
few other insect orders that have polytene chromo- 
somes (such as Collembola, the springtails). 

Bridges’ maps have been extensively improved, 
most notably by a partnership between Bridges and 
his son, P.N. Bridges [19-23], in which individual 
bands within each subdivision were given identi- 
fying numbers, a real tour de force of optical micro- 
scopy. The Bridges’ revised map contains 5059 
bands, which are all uniquely identifiable. This is in 


rl 


Fig. 29.2. Diagrammatic representation of the polytene 
chromosomes of Drosophila melanogaster. (a) The polytene 
chromosome complement. The polytene chromosome 
arms consist of the paired homologues replicated 1000- 
fold (in Drosophila larval salivary gland nuclei) and 


tightly synapsed in an interphase-like state. The central 
white circle represents the chromocentre in which 
heterochromatic regions are located. (b) Representations 
of individual chromosomes, showing heterochromatic 
regions as white blocks. Parts (a) and (b) are not to scale. 
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very good agreement with the total found by Sorsa 
[24], during electron microscopic studies of the 
polytene chromosomes of D. melanogaster. There are 
of course minor disagreements between the band 
counts of these two maps, although Bridges’ map 
system can generally be relied upon to avoid 
confusion. A further set of maps, this time a 
photographic map aligned with Bridges’ 1935 map, 
was published in 1976 [25]. Taken together, these 
three maps are an essential research tool in the 
genetic analysis of D. melanogaster. 

An immense collection of chromosomal rear- 
rangements is available to the Drosophila investi- 
gator: inversions, transpositions, translocations, and 
deficiencies (Drosophila terminology for deletions), 
many of which can be combined to yield a 
sometimes bewildering array of possible segmental 
aneuploids. Using these rearrangements, mutations 
may be mapped to a particularly high precision, 
often to within a few tens of kilobases. 


28.2.2.3 Polytene chromosomes and 
evolutionary studies 
Polytene chromosomes have facilitated studies of 
the phylogeny of Drosophila species. For example, 
within the melanogaster species group the phylo- 
genetic relationship of the six species was worked 
out on the basis of fixed inversions of the polytene 
band sequence [26]. This is possible because the 
banding pattern of the chromosomes of these 
species, while extensively rearranged by inversions 
relative to one another, is essentially identical. 
Muller [27] proposed an alternative nomenclature 
for chromosome arms, in which the standard 
designations X, 2L, 2R, 3L, 3R and 4 of D. melano- 
gaster are replaced by elements A, B, C, D, E and F, 
respectively, reflecting evolutionarily conserved 
genetic elements [27,28]. This terminology was 
introduced since, for example, chromosome arm 2R 
of D. melanogaster may not correspond with the arm 
designated 2R in another species. Additionally, 
chromosome arms break and rejoin during evolu- 
tion. For example, in D. virilis, each of Muller’s 
elements are present as separate chromosomes. It 
appears that these elements have been maintained 
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as units throughout the evolution of the genus 
Drosophila, as judged by the analysis of homologous 
mutations, chromosome banding patterns (where 
possible), and in situ hybridization. Indeed it has 
been suggested that Muller’s elements are con- 
served in flies as distantly related to Drosophila as the 
blowfly Lucilia [29] and mosquitoes. As discussed in 
Section 28.3.6, there are many applications of the D. 
melanogaster physical map to the study of evolution- 
ary relationships in the genus Drosophila, and further 
afield within the order Diptera. 


28.2.3 Molecular genetics of Drosophila 


Drosophila is in the vanguard of molecular genetic 
investigation of eukaryotic systems, and many 
important strategies and techniques were devel- 
oped or perfected for use in Drosophila, such as in 
situ hybridization, genomic libraries and chromo- 
some walking, to name but three. Consequently, it 
has remained the organism of choice for many 
researchers. 


28.2.3.1 The structure of the Drosophila genome 

The D. melanogaster haploid genome size is 0.18 pg, 
which corresponds to 170 x 10° bp [30]. In contrast to 
mammalian genomes, the DNA is unmethylated 
[31]. Table 28.1 describes the basic composition of 
the D. melanogaster genome. 


Satellite DNA Twenty-one per cent of the haploid 
genome consists of satellite DNA of a number of 
families. Satellite DNA in Drosophila is principally 
located in the pericentromeric heterochromatin, and 
in the Y chromosome. In general, the satellite DNA 
consists of simple-sequence repeats arranged in 
large blocks, although one class, the 1.688g ml" 
satellite located on the X chromosome, has a repeat 
unit of 359bp [32]. There are several other classes 
of satellite DNA repeats, some of which display 
characteristic distribution patterns among the 
chromosomes. 


Ribosomal RNA genes D. melanogaster rDNA is 
located in the nucleolar organizers, on the X and Y 


Table 28.1 The composition of 


Size(bp) _ the D. melanogaster genome. 


Category of DNA % of genome 
Total genome 

Single-copy sequences 61 

Satellite DNA 21 

Genes for rDNA, histones, etc. 3 
Transposable elements y 

Foldback DNA 6 


170 x 10° 
103.7 x 10° 
35.7 x 10° 
Dale 0 
15.3 x 10° 
10.2 x 10° 
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chromosomes. These include the genes for the 18S 
and 28S rRNAs [33]. bobbed (bb) is a mutation 
resulting from mutation of the rDNA arrays [34]. 
The bb phenotype is seen when the number of 
functional rDNA copies is reduced to 50% of normal 
wild-type levels. X-chromosomal bb alleles are 
complemented by the presence of rDNA arrays on 
the Y chromosome, Ybb- chromosomes are known. 


Histone genes There are ~ 100 repeats of the histone 
genes per haploid genome [35], located on chro- 
mosome arm 2L, at map location 39D3-39E1.2 [36]. 
In addition, a number of histone gene variants 
appear to be present elsewhere in the genome [36]. 


Foldback DNA Because foldback DNA is implicated 
as transposable DNA, and in the function of the 
family of large transposable elements, the TEs, it is 
described in the section on transposable elements 
below [37-42]. 


Transposable elements Approximately 9% of the D. 
melanogaster genome is composed of middle repeti- 
tive transposable elements, being dispersed on 
average every 13kb throughout the euchromatin. 
The families of Drosophila transposable elements and 
related elements are reviewed in refs 43 and 44. 
Some 50 families of transposable element have been 
identified in D. melanogaster, and they are typically 
present in copy numbers in the range 10-100. The 
most significant element with respect to the mole- 
cular genetic analysis of Drosophila has been the 
P element, and this is covered in greater detail in 
Section 28.2.3.2. Limitations of space prevent the 
listing of all transposable elements found in D. 
melanogaster. A reasonably complete listing can be 
found in ref. 5. Several of these elements are 
associated with hybrid dysgenesis. There is strong 
evidence for horizontal transmission of some trans- 
posable elements between species [45]. Drosophila 
simulans, a sibling species of D. melanogaster, appears 
to have a lower proportion of dispersed repetitive 
DNA, and different populations of transposable 
elements [46-48]. This difference can be exploited 
experimentally, as has been done, for example, by 
the European Consortium genome mapping project 
described in Section 28.3.2 below. 


Copia-like elements Copia elements possess direct 
long-terminal repeats (LTRs) of several hundred 
base pairs, and within these LTRs there are short- 
terminal inverted repeats. The structure of these 
elements is similar to that of retroviruses. Other 
members of this class of transposable element 
include gypsy, 297, 17.6, mdg-1 and 412. Many 


Drosophila mutations are due to the insertion of copia 
or other members of this class of element. 


Long-terminal inverted repeat elements A substantial 
proportion of the genome is composed of rapidly 
reannealing foldback DNA; it has been estimated 
that there are about 2000-4000 pairs of inverted 
repeats in the D. melanogaster genome. These 
structures have been shown to be transposable [40], 
and they are quite variable both in terms of the 
length and sequence of the repeats (which are 
themselves internally repetitious), and the length 
and sequence of the region between the repeats. The 
very large TE elements of Ising [49] are derived from 
foldback (FB) elements, and typically contain a 
section derived from the X chromosome spanning 
the white to roughest interval, sufficiently large to be 
seen cytologically in the polytene chromosomes in 
some cases. 


Transposable elements with short inverted terminal 
repeats The most notable member of this class is the 
P element, which will be described in greater detail 
below. Another element in this class, hobo, is also 
implicated in a hybrid dysgenesis syndrome, and 
has been used as an insertional mutagen and as a 
germline transformation system in a similar way to 
the P element [50]. The activity of both P and hobo 
elements are associated with high levels of chromo- 
some rearrangements [51]. 


Transposable elements without terminal repeats The I 
factor is the causative agent of IR hybrid dysgenesis 
in D. melanogaster. 1 factors are related to the 
mammalian LINE elements, and Drosophila F 
elements [52]. These elements appear to transpose 
via an RNA intermediate. 


28.2.3.2 The P element and its uses 

The most important transposable element from the 
point of view of Drosophila genetics is the P element. 
This element is the cause of P—M hybrid dysgenesis, 
a syndrome of multiple effects such as male 
recombination (which does not normally occur in 
Drosophila), high frequency of chromosome rear- 
rangement, sterility, and high mutation rates [53]. 
The dysgenic effects are seen in the descendants of 
crosses of P strain males with M strain females, but 
not vice versa. P strains contain P elements and have 
the P cytotype, whereas M strains lack P elements, 
and have the M cytotype. The dysgenic effects 
appear to be due to the transposition of P elements 
within the genome of the P strain male, as it enters 
the permissive environment of M cytotype eggs at 
fertilization. The full-length P element is ~2.9kb, 
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and encodes a single transcription unit of four 
exons, which are differentially spliced to encode 
either the transposase or a repressor of transposition 
[54]. Additionally, factors encoded in the host 
genome are required for P element transposition 
[55]. The elevated mutation rates typically seen in 
hybrid dysgenesis are due to the insertion of P 
elements within or in close proximity to transcrip- 
tion units. 


Transposon tagging The mutagenic effect of P- 
element insertion has been of use in gene cloning 
strategies. Because the mutations due to P tran- 
sposition generally retain a copy of the element 
either within or near the coding region, these mutant 
genes can be cloned by virtue of the inserted element 
[56]. This procedure is known as transposon tag- 
ging, and is of course not restricted to P elements. 
Transposon mutagenesis has been refined so that 
stable single insertions of P elements are recovered 
[57]. Collections of Drosophila stocks each bearing a 
single marked P insertion associated with a lethal or 
visible mutant phenotype are important com- 
ponents of the European and US Drosophila genome 
projects (see Sections 28.3.2.6 and 28.3.3.2) 


P element-mediated germline transformation In the 
early 1980s, Rubin and Spradling developed a 
system by which the P element could be used to 
introduce DNA into the germline at high frequencies 
[58,59]. Essentially, P element-mediated germline 
transformation utilizes two components. The first 
component is the vector, a P element modified so 
it does not encode transposase, into which the 
DNA under investigation is inserted, along with a 
selectable marker. The second component is a helper 
P element, which encodes functional transposase 
but which cannot undergo transposition itself. The 
resulting transformant flies are genetically stable, 
since there is no active transposase present. Germ- 
line transformation can prove that a particular gene 
has been cloned, by rescue of a mutant phenotype 
with a candidate clone. 

Germline transformation has enabled the devel- 
opment of a wide variety of additional techniques 
with applications in genetics, developmental and 
cell biology. Enhancer trapping was initially 
developed using a single element containing a 
promoterless bacterial lacZ gene to detect the 
transcriptional activity of neighbouring enhancer 
elements [60]; it is now available as two-component 
systems. In these systems, one element expresses 
the yeast transcription factor Gal4 in a temporal 
and spatial pattern determined by a neighbouring 
enhancer. The Gal4-containing element is then used 


to drive a reporter gene, borne on a second element, 
via an upstream activating sequence from the Gal4 
promoter (UAS,,1,), placed upstream of the reporter 
gene [61]. This is a highly versatile system: any Gal4 
element insertion can be used with any responding 
gene cloned downstream of a UAS element. The 
responding gene can be a gene under experimental 
investigation, a reporter gene such as the bacterial 
chloramphenicol acetyl transferase (CAT) gene 
and lacZ [61], or a toxin used for cell ablation in a 
pattern corresponding to the Gal4 expression pattern 
[62]. 


P elements as mutagens Drosophila lacks a system 
analogous to homologous recombination in yeast 
which can be used to mutate a specific gene. In cases 
where genes have been identified solely from 
cloning experiments, corresponding mutants may 
not be available. P elements have been utilized in a 
sib-selection mutagenesis strategy [63,64], where 
potentially mutagenic insertions at a locus of 
interest are selected by PCR amplification of eggs 
laid by mutagenized females. A series of tenfold 
divisions of the pool of mutagenized flies can be 
screened, ultimately identifiying single flies contain- 
ing a particular insertion. Of course, not all inser- 
tions will have a mutagenic effect. 


28.3 Mapping projects 


The Drosophila genome has been mapped at several 
levels of resolution, reflecting the insert size in yeast 
artificial chromosome (YAC), P1 phage, or cosmid 
vectors. Phage A-vectors have too small a cloning 
capacity to be of use in large-scale mapping endeav- 
ours in Drosophila. There has been cooperation 
between the genome mapping groups in the 
exchange of materials and information, but in 
practice they have been run independently. I will 
describe these projects in a loose chronological 
order. The first to be discussed are those that have 
reached an acceptable degree of conclusion: the 
YAC-based maps of D. Hartl and colleagues, in 
which clones were ordered by in situ hybridization 
to polytene chromosomes. In situ hybridization of 
randomly selected clones has also been used to build 
a framework map of the D. melanogaster genome in 
P1 clones, with the same library used in the 
Drosophila Genome Center mapping project (see 
Section 28.3.3). A second project, which also makes 
extensive use of the polytene chromosomes, is a 
collaborative project funded by the European 
Community (EC) to map the genome in cosmids. 
While conventional fingerprinting techniques are 
used, chromosome microdissection (see ref. 65 and 
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Chapter 11) is used to subdivide the genome prior to 
mapping. Finally, a comprehensive effort to tie 
together a wealth of genetic and molecular infor- 
mation to a P1 map is being conducted by a large 
collaborative venture involving the laboratories 
of Rubin, Spradling, Hartl and Palazzolo, which 
together form the Drosophila Genome Center. 

All these mapping endeavours are being carried 
out concurrently, and will provide a multilayered 
map which will be of immense utility. It should be 
pointed out that all of the above projects are dealing 
with the euchromatic portion of the genome. The 
heterochromatin, containing long arrays of simple 
repeat satellite DNA, is not easily amenable to 
physical mapping at this time. 


28.3.1 The Drosophila YAC maps 


Two YAC-based mapping projects have been carried 
out, principally in the laboratories of D.L. Hartl 
(Harvard, formerly at Washington University), and 
I. Duncan (Washington University). The procedures 
involved have been directly transferred to the 
analysis of P1 genomic libraries of D. melanogaster 
(Section 28.3.3) and D. virilis (Section 28.3.6.1). 


28.3.1.1 Strategy 

The strategy adopted by Hartl, Duncan and 
colleagues was that of mapping YAC clones directly 
to the polytene chromosomes by in situ hybridiza- 
tion. This is an approach appealing in its simplicity 
and straightforwardness. YACs were chosen in 
favour of other vector systems principally for the 
large size of their cloned DNA segments, with the 
consequence that a correspondingly small number 
of in situ hybridizations need be carried out. The 
band count of the revised polytene chromosome 
map of Bridges and Bridges is 5059 bands, and 
since the euchromatic fraction of the genome which 
this represents is 65-70% of the total genome of 
165x10°bp, or 115x10°bp, the average band 
contains 22kb DNA (the interbands contain very 
little DNA). One therefore expects to find that a 
typical YAC clone containing an insert of 200-220 kb 
would span about 10 bands, and this is indeed what 
was seen [66]. 


There are drawbacks to this mapping strategy, 
relating to the interpretation of hybridization signals 
on polytene chromosomes: the experimenter’s inter- 
pretation of an in situ hybridization can be incorrect. 
This can be controlled relatively easily by including 
duplicate clones, and comparing the in situ readings. 
Cai et al. [67] describe the comparison of 38 YACs 
mapped by Ajioka et al. [68], and conclude that while 
the reported localizations are broadly correct, the 
precise end points defined by the two groups differ. 
In practice these differences are likely to be due to 
technical limitations on the resolution of in situ 
signals and chromosome bands. These include vari- 
able signal strength developed on the chromosomes, 
and variable chromosome morphology. The former 
can result in an overestimation or underestimation 
of the number of bands covered by a signal, and the 
second can prevent the accurate visualization of 
chromosome bands. 


28.3.1.2 Libraries 

Three YAC libraries were constructed, in the 
laboratories of D.L. Hartl and I.W. Duncan (Table 
28.2) [66,67]. The first library consists of 768 clones, 
derived from randomly sheared DNA from the 
wild-type strain Oregon RC. The genomic DNA was 
prepared from embryos and was size selected for 
DNA fragments larger than 120 kb. The fragments 
were cloned into the vector pYACP-1. This vector 
has a number of features associated with its use as a 
Drosophila vector, including terminal fragments of 
the transposable P element that flank both the 
cloned DNA and a bacterial G418-resistance gene 
driven by the Drosophila hsp70 promoter. In prin- 
ciple, these features should enable clones from this 
library to be reintroduced to the genome by P 
element-mediated germline transformation, though 
Garza et al. [69] do not report attempts to achieve 
this and, to date, successful germline transforma- 
tion using YAC clones in this vector has not been 
reported. 

The second library was derived from DNA 
partially digested with NotI, prepared by embed- 
ding cells from Oregon RC gastrulae in agarose 
plugs. Partial digestion with Notl was carried 
out in these plugs. The genomic DNA was 
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Table 28.2 Drosophila YAC 


libraries. DNA source 


DNA preparation 
Size selection 
Vector 

No. of clones 

No. analysed 


Oregon RC Oregon RC y; cn bw sp 

random shear NotII partial digest EcoRI partial digest 
> 120 kb = 150-550 kb 
pYACP-1 pYAC5 pYAC4 

768 2688 4032 

272 502 419 
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cloned in the vector pYAC5, resulting in 2688 
clones. 

The third library was prepared with DNA 
extracted from an isogenic y; cn bw sp. stock. The 
DNA was partially digested with EcoRI, and frag- 
ments of 150-550 kb selected for library construc- 
tion. The vector used was pYAC4, and 4032 clones 
were obtained. 

It was considered important that questions of the 
integrity and stability of YAC inserts be addressed 
for each library used. Repetitive sequences in 
particular might be expected to be susceptible to 
postcloning rearrangements. In the case of rRNA 
genes, about 50 YAC clones containing rDNA 
repeats were selected and analysed, indicating that 
there is no obstacle to the cloning of these regions in 
the YAC systems used. These clones hybridize in situ 
to the chromocentre of the polytene chromosome 
complement, as expected. However, some instabil- 
ity of clones was observed by comparing restriction 
patterns of different isolates of identical clones. 

Long stretches of simple-sequence satellite DNA 
were not recovered from the YAC libraries examin- 
ed. In the Drosophila genome, these sequences are 
located in heterochromatin, which is principally 
found near the centromeres of all chromosome arms, 
and throughout the Y chromosome. At these loci, 
satellite DNA is arranged in long blocks of repeated 
units. However, it is not immediately obvious which 
stage of the cloning process discriminates against 
the satellite sequences. Hartl [66] discusses a 
number of possibilities, including the hypothesis of 
Lohe and Brutlag [70] that this DNA is lost at the 
stage of DNA purification, due to the nature of 
DNA-chromatin protein interaction in heterochro- 
matin. 

The representation of single-copy sequences was 
assessed by hybridization of cloned genes to the 
library. In no case was there a failure to identify at 
least one clone in the library. In addition, examina- 
tion of the organization of the inserts in comparison 
to the previously characterized clones revealed no 
detectable rearrangement during the YAC cloning. 
A further point is that since the libraries were made 
from DNA extracted from embryos of mixed sex, 
there is an expectation that there would be reduced 
representation of X-chromosomal regions relative to 
autosomal regions (assuming there to be an equal 
proportion of male and female, there will be only 
three X chromosomes and one Y chromosome for 
each four sets of autosomes). 


A very similar approach is being taken with a 


library of P1 clones, which are being ordered by in 
situ hybridization to polytene chromosomes [71,72]. 
This P1 library is the library used by the Drosophila 


Genome Center for a large-scale mapping project, 
and will be described in Section 28.3.3. 


28.3.1.3 Final status of the YAC maps 

Estimation of the degree of YAC clone coverage of 
the genome is made by calculating the proportion 
of chromosome bands covered by YAC clones, as 
judged by in situ hybridization. These calculations 
are always approximate, due to technical consider- 
ations discussed in Section 28.3.1.1. 

Without the resolution afforded by the polytene 
chromosomes, the linear relationships between YAC 
contigs and unattached YAC clones would be 
impossible to determine. However, since in situ 
hybridization to polytene chromosomes is the 
means by which these contigs were assembled, even 
those clones that remain unattached are a usable 
component of the final map. This mapping strategy 
results in a map in which clone overlaps are not 
molecularly characterized. 


Hartl map The final YAC map is estimated to 
represent 90% of the euchromatic genome [66,73], as 
all but 550 of the 5157 bands of the Sorsa EM 
polytene chromosome map are covered by YAC 
clones in in situ hybridization experiments. It 
consists of 1193 YAC clones in 149 contigs. The 
estimate of the number of contigs is conservative: 
the number may well be lower, since YACs with in 
situ signal overlaps shorter than two bands are not 
scored as overlapping. The average insert size of the 
YAC clones is 200 kb. 


Duncan map Cai et al. [67] estimate their map covers 
about 76% of the autosomal euchromatin, and 63% 
of the X chromosome euchromatin. The under- 
representation of the X chromosome in these 
libraries is expected, as described above. In addition, 
sequences derived from the fourth chromosome are 
under-represented, for reasons that are at present 
unclear. 


28.3.1.4 Availability of clones 

The complete list of mapped YAC clones is to be 
found in FlyBase (see Section 28.3.5). The listing 
contains clone names with the cytological map 
location on polytene chromosomes, and information 
on how to obtain clones. 


28.3.2 European Consortium cosmid map 


28.3.2.1 Strategy 

The extensive use of polytene chromosomes is a 
feature of the cosmid-based physical map being 
constructed by a consortium of European labo- 
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ratories, headed by F.C. Kafatos (EMBL, Heidelberg) 
[74]. The laboratories involved are those of C. 
Savakis and C. Louis (IMBB, Crete), M. Ashburner 
(Cambridge, UK), D.M. Glover and R.D.C. Saunders 
(Dundee, UK), and J. Modollel (Madrid, Spain). The 
strategy adopted is illustrated in Fig.28.3. This 
approach uses fingerprinting techniques, as devel- 
oped for the Caenorhabditis elegans cosmid-based 
mapping project (see Chapter 29) [65]. However, 
unlike the C. elegans methodology, cosmids for 
fingerprinting are not selected at random from the 
genomic library. Rather, the genome is separated 
into about 100 similarly sized segments, corre- 
sponding to each of the Bridges’ map divisions, by 
chromosome microdissection and PCR amplifi- 
cation. DNA amplified in this way is used to screen 
the master cosmid library, in order to identify 
division-specific minilibraries of clones. Each mini- 
library is treated as a separate small genome of 
1.2 Mb for the purpose of assembling contigs of 
cosmids. This has a beneficial effect, as it permits the 
use of reduced stringency in the computer matching 
of clones, with no consequent increase in spurious 
overlap detection. Additionally, the map for each 
division approaches completion earlier than if the 
whole library were to be fingerprinted simulta- 
neously. 

All contigs are checked by in situ hybridization 
of some member clones to polytene chromosomes, 
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Fig. 28.3 Flow chart of the European Consortium 
genome mapping project. 


and unattached cosmids are also mapped in this 
way. Any clones derived from unexpected locations 
are reassigned to their correct divisional map. 
Sequence-tagged sites (STSs) are determined from 
selected cosmids, which together with hybridization 
of cloned genes to the cosmid library permits 
alignment of the physical and genetic maps. A flow 
chart illustrating the strategy is presented in 
Fig. 28.3. 


28.3.2.2 Library construction 

The cosmid library was constructed in the cosmid 
vector Lorist 6 [75], using DNA extracted from the 
wild-type strain Oregon R. DNA was extracted from 
freshly eclosed adults, and was partially digested 
with Sau3 A, then ligated in the BamHI site of Lorist 
6. Nineteen thousand and two hundred indepen- 
dent colonies were transferred into wells of 200 
microtitre plates, for long-term storage. For library 
screening, the clones were gridded manually on 25 
filters of 768 clones (eight microtitre plates per filter). 
Subsequently, 192 of the microtitre plates were 
picked using a robotic device built and operated by 
H. Lehrach’s laboratory at the ICRF in London. 
Robotically picked filters either consisted of two 
22cm’ filters bearing 96 x 96 colonies on each, or one 
22cm? filter bearing 192 x96 colonies, in a regular 
array. 


28.3.2.3 Polytene chromosome microdissection 
In the early stages of this project, conventional 
microcloning was used to prepare region-specific 
probes, a procedure in which DNA microdissected 
from polytene chromosomes is digested with 
restriction enzyme then cloned in a A insertion 
vector [76]. However, this approach was not 
technically straightforward, and it proved impos- 
sible to routinely generate usable probes. Only one 
division was mapped using a microcloned probe. 
All subsequent division-specific probes were de- 
rived from PCR amplification of microdissected 
DNA (see Chapter 11) [77,78]. This procedure 
produces probes of much higher complexity than 
does microcloning. Microdissected DNA is cleaved 
to completion with Sau3A, and double-stranded 
adapter oligonucleotides are ligated to the cohesive 
termini, to provide priming sites for PCR amplifi- 
cation. Because the polytene chromosomes of 
Drosophila represent an initial amplification of 1000- 
fold because of polyteny, a single microdissection 
provides enough material to generate a represen- 
tative probe. 

A major problem in this approach was rapidly 
identified. The presence of dispersed middle 
repetitive DNA in the genome results in a high 
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frequency of misidentification of clones using these 
probes. For example, a microdissection of division 1 
will identify most or all of the cosmids containing 
inserts derived from that division. However, should 
there have been (for example) a copia element 
present in the initial microdissection, a number of 
clones will be identified solely because they contain 
further copies of copia. As described in Section 
28.2.3.1, there is evidence that the genome of D. 
simulans contains a lower proportion of dispersed 
middle repetitive DNA [46-48]. Furthermore, only a 
proportion of that present in D. simulans is also 
present in D. melanogaster. For these reasons, the 
microdissections were carried out on the polytene 
chromosomes of D. simulans. These polytene chro- 
mosomes are essentially identical to those of D. 
melanogaster, differing principally by a large inver- 
sion in chromosome arm 3R [27]. 

It has been found empirically that this strategy 
alleviates the problem of transposable elements in 
microdissected probes, but that it does not eliminate 
it entirely. A proportion of clones in each screen are 
thus derived from an essentially random genomic 
location, though these clones are not lost from the 
analysis. The question of whether the nucleotide 
sequence conservation between the two species is 
sufficient for successful use of this strategy was 
addressed empirically [76], by probing a Southern 
blot of A-phage clones from a walk in the achaete- 
scute region with an appropriate microdissected 
probe. Virtually all of the restriction fragments 
hybridized with the probe. It has been estimated 
that the sequence divergence between the majority 
of the single-copy sequences in these two species is 
only of the order of 3-4%. The probes prepared by 
microdissection consist of a pool of small restriction 
fragments, used to identify segments of DNA some 
100 times their own size. In practice there appear 
to be no problems due to nucleotide sequence 
divergence. 


28.3.2.4 Fingerprinting and contig assembly 

Cosmids are fingerprinted and analysed essentially 
as described for the C. elegans genome mapping 
project [79] with the modification that the enzyme 
used for generating the fingerprints is Hinfl. 
Cosmids corresponding to each division are 
analysed together, and kept initially in an individual 
division-specific database. 

Following contig assembly, a number of cosmids 
are selected from each contig and analysed by in situ 
hybridization to polytene chromosomes to verify 
their location on the cytogenetic map. Not all 
cosmids can be assigned to a single locus, however. 
In some cases a cosmid has, in addition to a primary 


site, a number of secondary sites with weaker 
signals, which probably correspond to dispersed 
transposable elements. In these cases, the primary 
site can still be determined. In other clones, a large 
number of sites of equal intensity are found, making 
it impossible to verify the cytogenetic location of the 
clone. Hybridization of these clones to D. simulans 
chromosomes can in many cases resolve the location 
of the clone, because of the differences between the 
population of repetitive DNA in D. simulans and D. 
melanogaster. 

The cosmid-based physical map has been aligned 
with the recombination map in three ways. First, 
members of the research community have made 
existing cloned genes available for hybridization 
studies, and these clones have been used to screen 
the master cosmid library. In some cases contigs 
could be linked, or new cosmids mapped by this 
procedure. A simpler approach has been to syn- 
thesize oligonucleotides corresponding to genes 
with entries in the sequence databases, and to use 
these as hybridization probes. Oligonucleotides 20 
or 25 bases long have been successfully utilized. To 
avoid problems with base composition affecting the 
T,, of the hybrids, the washing conditions of Wood et 
al. [80] were used. This eliminates the annealing 
strength differences of GC compared with AT base 
pairs, and means that all oligonucleotide probes 
may be washed at the same stringency. Finally, 
database searches conducted with STSs determined 
from the termini of cosmid clone inserts (see below) 
has also identified cloned genes which are therefore 
linked to the cosmid map [81]. 


28.3.2.5 Sequence-tagged sites 

Sequence-tagged sites (STSs) [82] are being deter- 
mined at intervals along the physical map, for two 
reasons. Firstly, they will make the map inde- 
pendent of the library with which the map is being 
constructed. Any investigator with access to a PCR 
machine and an oligonucleotide synthesis facility 
will be able to generate a locus-specific probe, by 
amplification from genomic DNA using primers 
deduced from the STS, with which to screen any 
genomic library available. Secondly, this part of the 
project has proved to be an efficient means by which 
the recombination map can be aligned with the 
physical map (and therefore also the cytogenetic 
map). STSs are determined by sequencing the 
termini of the cloned DNA segments within certain 
cosmid clones. These cosmid clones are selected on 
the basis of their location within contigs, and from 
the unattached cosmids with unambiguous cytolo- 
gical locations. A further use of the STSs is in the P1- 
based genome map of Rubin and colleagues (see 
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Section 28.3.3). All the STS sequences generated in 
this project are deposited in the EMBL sequence 
database, and listed in FlyBase (Section 28.3.5). 
Searches for homology between STSs and sequence 
databases reveals that a number of STSs correspond 
to known Drosophila genes, and others to Drosophila 
homologues of known genes from other organisms. 
For example, within 568 STSs determined from 
mapped cosmids derived from the X chromosome, 
33 corresponded to known Drosophila genes, and 
nine represented homologues of genes cloned from 
other species. 


28.3.2.6 Determination of sequence tags from 

single P element insertions 

A more recent addition to the European Drosophila 
Genome Mapping project concerns the characteriz- 
ation of a collection of P element insertions 
consisting of ~ 3000 Drosophila lines, each containing 
a single third chromosomal P element insertion 
associated with lethal or subvital mutations. These P 
element insertion lines are being analysed in order 
to provide further detail to the cosmid physical map. 
The P element constructs selected for the muta- 
genesis have a number of important features. First, 
the elements are not autonomous; they cannot 
transpose in the absence of transposase gene 
function, and Drosophila strains containing such 
elements are therefore stable. Second, they have 
either a lacZ or Gal4 reporter gene acting as an 
enhancer trap. Third, they contain Escherichia coli 
plasmid sequences (an origin of replication, and an 
antibiotic resistance gene) enabling integrated 
elements to be cloned, together with flanking DNA, 
by plasmid rescue. 

The initial round of analysis involves mapping the 
insertions by testing them against chromosome 
deficiencies. Subsequently, each group of lines is 
then further analysed by complementation testing. 
In addition to identifying duplicate loci, this step 
also enables the recognition of background lethals 
that may have been in the mutagenized fly strain 
prior to mutagenesis, and which will not be asso- 
ciated with P element insertions. The cytological 
position of each insertion is subsequently precisely 
determined by in situ hybridization to polytene 
chromosomes. 

Because the integrated P element vector still 
contains a functional plasmid origin of replication 
and encodes ampicillin resistance, genomic DNA 
sequences flanking the site of insertion may be 
cloned by plasmid rescue. The DNA sequence 
immediately flanking the site of insertion will be 
determined as an STS. However, because each 
insertion is associated with a lethal phenotype, each 


of the STSs determined is tightly linked to a vital 
gene. Genomic sequences isolated in such plasmid 
rescues will be related to the cosmid-based physical 
map by hybridization or PCR, providing a link 
between genetic and physical maps, and would 
be expected to identify a number of novel genetic 
loci. 


28.3.2.7 Current status of the map 

Analysis of the X and second chromosomes is 
essentially complete. Some cytological divisions 
appear to be under-represented in this map [81], for 
reasons that are poorly understood. Estimates of 
coverage are calculated per cytological division 
range from 28% to over 100% and average 64%, and 
are based on the DNA content per band calculated 
by Sorsa [24], and an estimate of the size of the 
cosmid contigs. This takes into account those 
cosmids mapped to the division, but which are not 
members of contigs. 


28.3.2.8 Availability of cosmid clones 

All the cosmids mapped and analysed are freely 
available to the research community. The complete 
list of clones, with cytogenetic location and other 
information can be found in FlyBase. STS sequences 
are deposited in the EMBL sequence database. 


28.3.3 The Drosophila Genome Center 


A major initiative towards a physical map closely 
tied to the wealth of available genetic data is being 
undertaken by the Drosophila Genome Center (G.M. 
Rubin, personal communication). The principal 
investigator is G.M. Rubin (Berkeley), and the 
Center actually comprises a number of laboratories: 
those of C. Martin and M. Palazzolo (Lawrence 
Livermore, Berkeley), D.L. Hartl (Harvard), and A. 
Spradling (Carnegie Institution, Baltimore). Signifi- 
cantly, this project is working closely with the 
automation laboratories at Lawrence Livermore to 
develop software and hardware for the project. The 
Center can be accessed through FlyBase. 


28.3.3.1 Construction of a complete P1 physical map 

The Drosophila Genome Center is constructing a map 
using the P1 cloning technology developed by 
Sternberg [83]. The library used was constructed in 
D.L. Hartl’s laboratory [71] and contains five 
genome equivalents. This library has been the 
subject of an in situ-based mapping strategy very 
similar to that carried out for the YAC library by 
Hartl’s laboratory [71,72]. The Drosophila Genome 
Center is engaged in assembling these clones into 
contigs with molecularly defined overlaps, as 
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determined by STS content mapping. Figure 28.4 
shows a flow chart depicting the Drosophila Genome 
Center’s mapping strategy. 


In situ hybridization mapping A strategy analogous 
to that used in the YAC map project described in 
Section 28.3.1 has been used to create a ‘framework’ 
map of. P1 clones aligned with the polytene 
chromosome cytogenetic map [72]. 

The P1 library consists of 9216 clones, of which 
=40% were made using the vector pNS582- 
tet14Ad10, and the remainder with pAd10sacBIl. 
The genomic DNA was derived from nuclei isolated 
from adults of an isogenic y; cn bw sp. stock. The 
genomic DNA was partially digested with Sau3A 
before ligation into the BamHI sites of the two 
cloning vectors. The two sets of clones have a very 
similar distribution of insert sizes, averaging 
slightly over 80 kb. Hartl et al. [72] describe the in situ 
analysis of 3104 clones. Of these clones, 388 
hybridized to the chromocentre or to many euchro- 
matic sites, and 191 clones were deliberate dupli- 
cates to assess the accuracy of the interpretation of in 
situ results. The presence of transposable elements 
was inferred for about 10% of the remaining 2461 
clones, which gave dispersed multiple sites of 
hybridization. As with the cosmid in situ hybri- 
dizations described in Section 28.3.2, in general 
the primary signal is easily determined by its 
intensity. The mapped clones with unique or 
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Fig. 28.4 Flow chart of the Drosophila Genome Center 
genome mapping project. 


primary hybridization sites total 200Mb of insert 
DNA (assuming an average insert size of 80kb), 
equivalent to 1.8 copies of the euchromatic fraction 
of the haploid genome. The map represents an 
estimated 85% coverage of the genome. Hartl et al. 
[72] present a diagrammatic representation of the P1 
clone map relative to the polytene map. 

These in situ hybridization studies revealed 64 
clones with two sites of hybridization. Although 
these clones were not investigated further, it is 
probable that they represent chimaeric clones. 


STS content mapping STS content mapping is carried 
out using a two-stage PCR analysis of the entire P1 
clone library. Primers are designed from the STS 
sequence itself. Positive signals are detected as 
bands on agarose gel electrophoresis. To streamline 
the PCR analysis, the P1 library, picked into 96 96- 
well microtitre dishes, has been divided into a 
collection of pools: there are 96 plate pools, each of 
which contains all 96 clones from a plate. In the first 
stage, PCR amplifications are performed on all 96 
pools, by which the plates containing clones 
spanning a particular STS can be rapidly identified. 
In the second stage, row and column pools 
(containing 12 and 8 clones, respectively) from 
microtitre plates identified in the first round are 
screened to identify positive clones unambiguously. 

Initially, STS content mapping was used to 
assemble contigs for the genomic regions containing 
the bithorax complex (300kb), the Antennapedia 
complex (350 kb), and a 2000 kb region around the 
alcohol dehydrogenase locus (Adh). The latter region 
is a section of the genome that has been extensively 
characterized genetically and cytogenetically: many 
mutants and chromosome rearrangements are 
available. 

STSs are determined by sequencing the termini of 
cloned inserts of the P1 library. Additionally, STSs 
are determined from the insertion sites of P elements 
(see Section 28.3.2.6) and from Drosophila gene 
sequences in the sequence databases. 


28.3.3.2 The use of P elements as sequence-tagged sites 

As described in Section 28.2.3.2, P element trans- 
formation vectors have been of great importance in 
modern Drosophila genetics, with some highly 
sophisticated systems in use. A.C. Spradling and his 
laboratory have accumulated a large number of 
Drosophila stocks, each bearing a single P element 
insertion and yielding a mutant phenotype as a 
consequence of that insertion. Each of these in- 
sertion stocks therefore is a marker for a gene. The P 
element used contains plasmid DNA positioned 
between the P element termini, and which is 
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therefore inserted into the genome along with the 
element. The plasmid DNA enables the element to 
be ‘rescued’ from genomic DNA by restriction 
digestion followed by recircularization. These res- 
cued plasmids also contain fragments of genomic 
DNA that originally flanked the P element. P 
elements are being retrieved from these stocks by 
plasmid rescue, and their sites of insertion 
sequenced to yield STSs. 

Each P element line is analysed by in situ 
hybridization to determine the location of the 
insertion on the polytene chromosome map. Those 
insertions with similar map positions are checked by 
genetic complementation tests for allelism. This 
analysis also reveals those lines with more than one 
inserted element (about-7%), which are not suitable 
for further analysis. 

In order to facilitate the dissemination of these 
materials, copies of the collection of stocks have 
been distributed to five laboratories, and to the 
Bloomington Stock Center. Stocks can be ordered by 
e-mail: through FlyBase. Several stock centres may 
be accessed in this way. 


28.3.3.3 cDNA project 

The original goals of this project were to accumulate 
2000-3000 different cDNA sequences (expressed 
sequence tags, ESTs) for use as STSs. Analysis by 
database searches would be likely to identify many 
genes, and spot homologies. However, the emphasis 
has now shifted towards the characterization of all 
transcribed sequences in a defined region of the 
Drosophila genome, the 2 Mb region around the gene 
Adh. Accordingly, this project is now aimed at the 
large-scale sequencing of this genomic region, by 
using and developing novel nonrandom sequencing 
strategies, and the characterization of cDNAs 
encoded within the region. 


28.3.3.4 Large-scale sequencing 

Shotgun sequencing presents a number of problems 
when applied to large-scale DNA sequencing (see 
Chapter 20). In addition to the redundancy of 
repeatedly sequencing the same stretches of DNA, 
many projects end with the synthesis of many 
oligonucleotide primers with which gaps between 
sequence may be bridged. The Drosophila Genome 
Center has developed a directed sequencing stra- 
tegy that overcomes many of the drawbacks of con- 
ventional shotgun sequencing. The overall strategy 
is based upon the transposon-insertion method 
of generating templates, developed by Palazzolo 
and colleagues [84]. A significant emphasis on the 
use of specifically designed software and hardware 
is being made by the Drosophila Genome Center. 


The assembly of ordered arrays of subclones The 
strategy begins with the establishment of conti- 
guous arrays of P1 clones. These clones have an 
average insert size of about 80kb, which must be 
subdivided before sequencing. An ordered set of 
~960 subclones containing inserts of about 3kb is 
made from each P1 clone. Clones are ordered by 
PCR amplification based upon limited sequence 
analysis of subclones. The subclone library is 
analysed by a pooling system similar to that used in 
the STS content mapping of the P1 library. However, 
in this case one of the primers is complementary to 
the plasmid vector flanking the site of insertion, 
while the second primer is complementary to 
sequence determined from one of the subclones. 
Thus each clone will yield a different sized DNA 
fragment upon PCR amplification, with the frag- 
ment size dependent upon the degree of overlap 
with the starting clone. In this way 30 DNA pools 
need be screened by PCR (10 plate pools, 8 row 
pools, and 12 column pools) to identify suitable 
clones. By determining sequence from the mini- 
mally overlapping clone, a further round of analysis 
can be undertaken. The end result of this stage of the 
analysis is an ordered array of small subclones 
corresponding to a contig of P1 clones. 


Transposon-facilitated DNA sequencing Strathman et 
al. [84] describe the use of y5 transposon mobiliz- 
ation to provide priming sites for DNA sequencing. 
In brief, the E. coli host strain carrying an F factor 
with a 6 element is transformed with an ampicillin- 
resistant genomic subclone and allowed to con- 
jugate with a kanamycin-resistant strain. By plating 
conjugants on medium containing both kanamycin 
and ampicillin, only cells receiving plasmids 
transferred during conjugation as yé transposition 
cointegrates between the subclone plasmid and the 
F plasmid are recovered. Following resolution of 
the cointegrate, which yields plasmids with 6 
insertions, the colonies are rapidly screened by PCR 
using one vector primer and one y5-specific primer 
to establish the site of integration of the yé element. 
A minimally overlapping set of clones is selected for 
sequencing. 

Sequencing is carried out with chain termination 
protocols, using sequencing primers complemen- 
tary to the yd transposon. Contig assembly is per- 
formed using both commercially available software, 
and software developed by the Lawrence Berkeley 
Laboratory (LBL) Human Genome Center comput- 
ing group (see Appendix V.1 for contact address). 


28.3.3.5 Availability of clones 
The computer database FlyBase should be consulted 
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for details of mapped P1 clones. Gridded copies of 
the P1 library have been distributed to a number of 
laboratories around the world, and clones can be 
obtained from these sources. This strategy was 
adopted in order that clones might be obtained by 
researchers quickly, efficiently and at low cost. Clone 
requests should be directed to the nearest laboratory 
holding a copy of the library. FlyBase contains 
details of these laboratories. 


28.3.4 Sequencing the Drosophila genome 


At present, genome sequencing of the D. melano- 
gaster genome is being undertaken by two groups. 
The US Drosophila Genome Center is using the 
technology outlined in Section 28.3.3.4 to carry 
out large-scale sequencing, while the European 
Union is funding a consortium of laboratories to 
carry out a pilot project in which the sequence of the 
terminal three divisions of the X chromosome will 
be determined. In contrast to the US DGC strategy, 
the European project is using a shotgun cloning 
approach, in which individual mapped cosmids are 
fragmented by sonication, and the fragments sub- 
cloned to yield templates for sequencing. 


28.3.5 FlyBase 


FlyBase is a database containing information on the 
genetics and biology of Drosophila, being built by a 
collaboration between researchers funded by the 
National Institutes of Health in the USA, and the 
Medical Research Council in the UK. The labora- 
tories involved in this collaboration are those of W. 
Gelbart (Harvard), M. Ashburner (Cambridge, UK), 
T. Kaufman and K. Matthews (Indiana University, 
Bloomington), and J. Merriam (UCLA). Access to 
FlyBase is most easily obtained by using a Web 
browser, such as Netscape. FlyBase is found at 
http: //morgan.harvard.edu:80/ where details of 
FlyBase mirrors in the UK, Australia and Japan may 
be found. 

FlyBase contains a wealth of genetic and molecular 
information concerning Drosophila. The text of the 
Lindsley and Zimm ‘Red Book’ [7] is included (only 
in the Indiana copy, owing to copyright reasons), as 
are lists of chromosome aberrations (sorted by class 
and cytological breakpoints), molecular clones, the 
genetic map, and stock lists of the international 
Drosophila stock centres, amongst other material. 
These files can be interactively searched. 


28.3.6 Genome mapping in other Diptera 


The application of genome mapping to the genetic 


study of other insect groups can be by both direct 
usage of the D. melanogaster genome map, and 
transfer of techniques developed for mapping the D. 
melanogaster genome. As well as the obvious benefits 
such maps yield for the molecular genetic analysis of 
these species, they will have an impact on a variety 
of areas, such as evolutionary biology. For example, 
phylogenetic relationships of many Drosophila 
species have been deduced from the analysis of the 
banding sequence of their polytene chromosomes. 
Ashburner [85] has made a strong argument for the 
importance of genetic mapping in a variety of insect 
species, and in many cases, molecular physical 
genome maps will be important. I will discuss the 
application of large-scale genome analysis to some 
Diptera other than D. melanogaster. 


28.3.6.1 Drosophila virilis 

At 313Mb, the genome of Drosophila virilis is 
approximately double the size of that of D. melano- 
gaster. There are six chromosome pairs in the diploid 
complement: acrocentric sex chromosomes, four 
pairs of acrocentric autosomes, and a pair of tiny 
autosomes. In general terms, the distribution of 
genetic loci is in agreement with the conservation of 
Muller’s elements (see Section 28.2.2.3), although 
the gene order within each element is scrambled 
relative to D. melanogaster. The X chromosome of D. 
melanogaster corresponds to the X chromosome of D. 
virilis, 2L to chromosome 4, 2R to chromosome 5, 3L 
to chromosome 3, 3R to chromosome 2, and 4 
corresponds to D. virilis chromosome 6 [86]. 

Drosophila virilis is a member of the subgenus 
Drosophila, while D. melanogaster is a member of the 
subgenus Sophophora. The polytene chromosome 
banding pattern of D. virilis is not similar to that of 
D. melanogaster, and the degree of sequence diver- 
gence between the two species is sufficient that D. 
melanogaster clones cannot be directly used in 
hybridization studies with D. virilis DNA. A mole- 
cular genome map would be of use to those studying 
genome evolution in the genus Drosophila, and to 
those engaged in molecular and genetic research in 
D. virilis. The D. melanogaster P1 map described in 
Section 28.3.3.1 has been utilized in two ways in the 
D. virilis work. Firstly, the transfer of molecular 
biological technology has permitted the develop- 
ment of a useful P1 library and mapping strategies, 
and secondly, protocols have been developed that 
allow the polytene locations of D. melanogaster 
homologues to be determined. 

A D. ovirilis P1 genomic library has been 
constructed by Lozovskaya et al. [87]. This library 
consists of more than 10000 clones of average insert 
size 65.8 kb. These clones are being mapped by in 
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situ hybridization to polytene chromosomes, as with 
the YAC- and P1-based maps. Because it has proved 
impossible to map D. melanogaster P1 clones to D. 
virilis polytene chromosomes reliably by in situ 
hybridization, owing to sequence divergence out- 
side conserved regions, a scheme for identifying 
homologous regions was devised. In brief, DNA 
fragments were PCR amplified from D. melanogaster 
genomic DNA, and used to screen the D. virilis P1 
library. These P1 clones were then used to map the 
D. virilis homologue by in situ hybridization. 


28.3.6.2_ Drosophila pseudoobscura 

Extensive research has been carried out on the 
evolution of the obscura species group, and in 
particular D. pseudoobscura. Consequently, genome 
mapping of D. pseudoobscura has been initiated [72]. 
In contrast to the situation described above for in 
situ hybridization studies using D. melanogaster 
P1 clones to probe D. virilis polytene chromosomes, 
D. pseudoobscura is sufficiently closely related to 
D. melanogaster for the corresponding experiments 
to work with a reasonable degree of success. D. 
pseudoobscura belongs to the same subgenus as D. 
melanogaster. However, the polytene chromosome 
banding patterns are not comparable, although 
experiments confirming the existence of Muller’s 
elements by in situ hybridization have been carried 
out. 


28.3.6.3 Anopheles gambiae 

The mosquito Anopheles gambiae is the major 
malarial vector in sub-Saharan Africa. As a conse- 
quence it has attracted a great deal of research 
interest. A low-resolution genome map based on 
microdissection of polytene chromosomes has been 
constructed [88]. This map consists of pools of DNA 
amplified by PCR following microdissection [65,78], 
and is intended to aid the molecular genetic 
characterization of this medically important insect. 

One of the points of interest in the genomic 
analysis is the presence of six sibling species best 
distinguished by chromosomal inversions that can 
be visualized in the polytene chromosomes of larval 
salivary glands or the ovarian nurse cells. Genomic 
analysis of the species complex will be vital if 
attempts are to be made to genetically modify 
natural mosquito populations. 

A detailed recombination map is being assembled 
for A. gambiae [89], using microsatellites as markers. 
Some of these markers were derived from the low- 
resolution microdissection genome map. All these 
microsatellite markers are effectively STSs tied to the 
recombination map, and by in situ hybridization to 
the polytene chromosome map. 


28.4 Conclusions and prospects 


At present, Drosophila researchers have access to 
three genome maps assembled by direct correlation 
of clones with the polytene chromosomes, using 
clone libraries constructed in YAC and P1 vectors. 
These maps are virtually complete, estimates of their 
coverage of the genome are currently 90% [66] and 
76-63% [67] for the two YAC maps, and 85% for the 
P1 map [72]. The European Consortium cosmid- 
based map has also been essentially completed to an 
acceptable degree of coverage for many regions of 
the genome (the X, and many of the chromosome 
map divisions of the autosomes). In the case of the X 
chromosome, the overall coverage in cosmids is 
estimated at 65%, though this figure varies between 
divisions [81]. Furthermore, in the near future, the 
P1-based map will be refined by molecular analysis 
to yield molecularly demonstrable clone overlaps. 
Drosophila researchers already have excellent access 
to clones derived from specific regions of the 
genome. 

The Drosophila genome is of similar size to a 
typical mammalian chromosome, so those mapping 
projects that are not tied into a specifically Droso- 
phila-based approach should function well as models 
for mapping strategies for larger genomes. In parti- 
cular, the approach taken by the Drosophila Genome 
Center is a good model. In these cases, the added 
benefit of the high-resolution cytological maps 
afforded by polytene chromosomes enable frequent 
checks on map quality to be made. 

Particular issues of interest in physical mapping 
that may be approached in Drosophila include ap- 
proaches to the long-range analysis of centromeres 
and heterochromatic regions of chromosomes. In 
particular, a powerful approach is to use deletion 
derivatives of chromosomes to narrow down 
regions absolutely required for accurate chromo- 
some segregation. 

Drosophila remains the model organism of choice 
in many areas of research, and the physical maps of 
the D. melanogaster genome now available represent 
a major asset in the future exploitation of this most 
important model organism. 
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29.1 Introduction 


Caenorhabditis elegans is a small, free-living soil 
nematode found in many parts of the world. It is a 
simple organism which is easily maintained and 
studied in the laboratory. The adult hermaphrodite 
has only 959 somatic nuclei, while the adult male has 
a total of 1031. The haploid genome is ~100 Mb, 
which is about eight times larger than the yeast 
Saccharomyces, two thirds the size of the fruit fly 
Drosophila and 30 times smaller than the human 
genome. 

About 80% of the C. elegans genome is composed 
of single-copy sequences, with the remainder being 
moderately repetitive sequences which occur in 
two to many copies per genome. Genes of C. elegans 
have been mapped into six linkage groups which 
correspond to the six haploid chromosomes. More 
than 1000 C. elegans genes, distributed over the six 
linkage groups, have been identified; study of these 
genes is leading to important new insights in 
neurobiology and developmental biology. 

In support of this enterprise, the C. elegans genome 
project was begun in the early 1980s. Since its incep- 
tion the genome project has pioneered approaches 
to physical mapping and genome sequencing. 
Today, the physical map of the genome is among the 
largest and most complete yet constructed for a 
multicellular organism. It consists of 17000 cosmids 
and 2500 YACs, which have been positioned relative 
to each other by gel fingerprinting and cross- 
hybridization. More genomic sequence and more 
sequenced genes are now available from the worm 
than from any other multicellular organism, and we 
are on target for completion of the full genomic 
sequence by the end of 1998. The utility of these 
resources can be judged by the many C. elegans 
laboratories now using the map and sequence to 
study mutationally defined genes, and by the use of 
sequence homologies by many more laboratories. 
The genome sequence is critical to gaining a 
thorough understanding of this important model 
organism and will aid in studies of human disease. 

In this chapter we describe the underlying 
philosophy and the general approaches that we feel 
have been important for the success of the project. 
These points are applicable not only to other small 
genome projects, but also to the much larger and 
more challenging Human Genome Project. 


29.2 The genome map 


The physical map of the C. elegans genome consists 
largely of overlapping cosmid and YAC clones [1-3]. 
Both components are essential: the YACs, by virtue 
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of their large inserts and propagation in yeast, 
provide long-range continuity and can hold DNA 
that is unclonable in bacterial cosmid clones; the 
cosmids provide high resolution locally and a more 
convenient substrate for biochemistry. 

The principal techniques for physical mapping 
have been: 

e restriction enzyme-based fingerprinting for con- 
struction of cosmid contigs (Fig. 29.1); 

¢ hybridization of individual YAC and cosmid 
clones to gridded arrays of cosmids and YACs, 
respectively; 

* sequence-tagged site (STS) assays for direct 
detection of YAC/YAC overlaps; 

¢ hybridization of YAC and cosmid clones to C. 
elegans chromosomes for long-range ordering. 

The topological constraints imposed by the 
ordered cosmids were important for the inter- 
pretation of YAC/cosmid hybridizations, in that 
they helped to distinguish genuine matches from 
spurious matches due to repetitive sequences. 

This physical array of cloned DNAs is made into a 
genome map by the wealth of genetic markers that 
have been attached to specific clones. This was 
facilitated by the early and unrestricted distribution 
of the clone resources and by the readiness of the 
community of C. elegans researchers to share in- 
formation prior to publication. In fact, the genetic 
matches were also critical mapping tools, in that 
they, along with in situ hybridization, provide the 
longest-range linkage. 

Unlike the physical mapping, which has been 
carried out mainly by the two laboratories above, 
the genetic linkage has been achieved by the 
cooperative effort of the entire C. elegans community. 
The communal approach is important in two ways. 
From the point of view of the map, it ensures that the 
specialized knowledge of all individuals and groups 
is brought to bear on the project. From the point of 
view of effort and funding, it means that everyone is 
involved and allows the central resources to be as 
lean and focused as possible. 

The map as a whole gains from being a multilevel 
construct: no single technique is sufficient by itself to 
provide full linkage, and strength arises from partial 
redundancy between the levels. This is important, 
because all mapping information is to some degree 
stochastic. 

The gel fingerprinting method is readily scalable, 
and is being applied increasingly to the human 
genome. Since the original nematode work, the 
procedure has been automated, fluorescent labels 
have been introduced instead of radioactive labels, 
and new generations of assembly software are 
appearing. In contrast to the worm, long-range 
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Fig.29.1 Autoradiogram showing fingerprints of 
nematode cosmids. The lanes with closely spaced bands 
contain markers, and the rest contain samples. Many 
different fingerprinting methods can be, and have been, 
used for genome projects. The only requirement is that a 
pattern is generated from each clone, such that the partial 
identity of patterns from a pair of overlapping clones can 
be recognized. For the C. elegans project, a method based 
upon cutting with HindIII and Sau3al was used, it was 
arranged that the HindlU sites, but not the Sau3al sites, 
were labelled radioactively by end-filling and the 
resulting fragments were separated on a denaturing 
polyacrylamide gel of the type used for sequencing 
[1,19]. From [19] by permission of Oxford University 


Press. 


order in the human genome is being achieved at an 
earlier stage, by STS analysis of YACs and radiation 
hybrids, and by in situ hybridization. However, just 
as in the worm, bacterial clones, whether cosmids, 
P1 clones (Pls), P1 artificial chromosomes (PACs), 
or bacterial artificial chromosomes (BACs) (see 
Chapter 15) will provide the preferred substrates for 
biochemistry. 


29.3 The genome sequence 


In sequencing the C. elegans genome, we have as far 
as possible adopted the same philosophy of 
collective endeavour. The task of the two central 
laboratories is restricted to collecting the data as 
efficiently as possible. They refrain strictly from 
exploiting the sequence data for their own research 
purposes before its release. As soon as the raw data 
has been assembled into contigs, it is available for 
screening by anyone by being placed on an 
anonymous ftp server. When each cosmid sequence 
is finished, it is analysed by computer to find pos- 
sible genes, database similarities and other features; 
the annotated sequence is then immediately sub- 
mitted to GenBank or the EMBL sequence database, 
as well as being placed in the C. elegans database 
ACEDB [4]. In this way, the expertise not only of the 
worm community but of the whole world is brought 
to bear at the earliest possible stage. 

Sequencing concentrated at first on the central 
regions of the autosomes and the whole of the X 
chromosome, totalling roughly 60% of the genome, 
because genetic and cDNA mapping data indicated 
that these areas contain the majority of the genes 
(perhaps more than 80% of the total; refs 5 and 6 and 
Y. Kohara, personal communication). The focus of 
sequencing has now shifted to the autosomal arms; 
although these may be less rich in genes, there are 
many important aspects beyond the protein coding 
elements that will only be addressed by the 
sequence of the whole genome. 

Starting in this way had the added benefit for 
the sequencing part of the project that we began by 
sequencing cosmids. Now, the ‘YAC bridges’ — the 
regions cloned in YACs but not in cosmids—are 
being dealt with. Starting from complete YACs is 
more difficult than from cosmids [7], because of the 
limited amounts and purity of material and their 
greater size. However, with improved analysis 
software, and with the whole yeast sequence avail- 
able to identify and remove contaminating host 
sequences, this method is practical. Smaller bridges 
have been successfully rescued by recombination 
from YACs in yeast, and others by long-range PCR. 
Some regions are susceptible to cloning in fosmid 
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vectors (Stephanie Chissoe, personal communi- 
cation) and many to cloning in A-vectors on per- 
missive hosts. 

Our principal sequencing strategy is an initial 
shotgun followed by directed finishing (described in 
detail in ref. 8, see also Chapter 20, Section 20.2). It is 
worth emphasizing here that shotgun sequences 
provide extremely detailed map information (in the 
form of the sequence itself), albeit only from a 
fraction of the subclone insert length. The proper 
assembly of these sequences allows the relative 
positions of the random subclones to be established, 
while at the same time producing the bulk of the 
final sequence. Like all mapping methods, however, 
it is vulnerable to repeated sequences. The density 
and accuracy of sequence information, however, 
compensates for the relatively short length of in- 
dividual reads. Improved assembly programs (see 
below) are taking greater advantage of this infor- 
mation, and the sequence itself provides powerful 
means of evaluating alternative maps. In uncertain 
areas, additional sequence—for example, from the 
opposite end of the insert—can provide additional 
map information. In addition, restriction enzyme 
digests of the parent clone provide a simple and 
direct means of testing overall map accuracy. 

Cost and accuracy are key considerations in 
evaluating effectiveness of any strategy. Current 
direct and indirect costs for the production of the 
final annotated sequence are below $0.45 a base, and 
total costs of all activities (including development 
and related research efforts) are below $0.70 per 
base. In general, the accuracy of the sequence 
appears to be better than 99.99%, on the basis of 
comparisons with previously sequenced genes, 
though these estimates are of limited reliability in 


the absence of a truly independent means of 
checking the sequence. 

Throughout and the ability to scale-up the effort 
have also been important. As a result of increased 
automation/mechanization, improved software 
and better biochemistry, the combined production of 
nematode sequence exceeded 70Mb in 1996. 
Collection of shotgun data for the remainder of the 
genome will be essentially completed in 1997, with 
finishing and gap-filling continuing in 1998. 

Our objective is to extract the information from 
the genome rather than to exhaustively sequence 
every last base. For example, even now we some- 
times describe long tandem repeats simply in terms 
of the consensus sequence and number of copies. 
The occurrence of such instances is more frequent as 
we move into repetitive regions, and so this method 
of reporting will increase. Conversely, we sequence 
all other regions as accurately as possible. 

The shotgun/directed approach can be applied 
equally well to the human genome, provided that 
the extensive repeat families are allowed for in the 
assembly algorithm. At first, we adapted R. Staden’s 
XGAP by screening the input so that Alu sequences 
were excluded from the initial assembly process. As 
for the nematode, we now begin with Phil Green’s 
(P. Green, personal communication 1994) PHRAP 
which makes positive but selective use of repeat 
sequences in assembly, and then feed the results to 
XGAP or other editing programs. 


29.4 Status of the sequencing project 
and its applications 


The current status of the sequencing effort is sum- 
marized in Table 29.1. All of the sequence has been 


Table 29.1 Current state of C. elegans 100-Mb genome sequencing project. 
i ee 


Five autosomes: total of eight gaps (all in gene poor regions) 


Physical map 17500 cosmids 

3500 YACs 

X chromosome: single contig of ~ 18 Mb 
DNA sequence 65 Mb completed as of March 1997 


1300 putative protein coding genes (~ 1 per each 5 kb) 

approximately 45 % have significant similarity to non-C. elegans genes 
as Pe ee oe 
As well as having access to finished sequences in the GenBank, EMBL and DDBJ databases, investigators can search 
these sequence data and also more preliminary unfinished sequences at the two genome centers. Currently, the 
searchable database contains 85 Mb of sequence data (65 Mb finished, 20 Mb unfinished) which is estimated to contain 
about 90% of the genes in C. elegans. Searches with a nucleotide or protein query use the BLAST programs and are 
submitted via a World Wide Web interface (see URL http: / / www.sanger.ac.uk/ and http://genome.wustl.edu/). The 
Web pages provide additional information about the genome and include help addresses. All data pertaining to the 
genome (including genetic and physical maps and the sequence) are combined in the database ACeDB. The latest release 
can be obtained by anonymous ftp from the following: USA, ncbi.nlm.nih.gov (130.14.20.1) in repository /acedb; UK, 
ftp.sanger.ac.uk (193.60.84.11) in pub/acedb; or France, lirmm.lirmm.fr (193.49.104.10) in genome/acedb. 
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subjected to a series of programs to provide an initial 
interpretation of its features. Comparison with 
expressed sequence tags (ESTs) (ref. 6 and Y. Kohara, 
personal communication) allows the construction of 
a confirmed transcription map for a third of the 
predicted genes. 

The C. elegans genome map is being strengthened 
by several other systematic studies. These projects 
are entirely independent, but their findings are 
united through the map and some of them draw on 


its resources. At the University of Leeds, UK, Ian 
Hope is collecting expression data using transgenic 
reporter constructs of the predicted genes (ref. 9 and 
I. Hope, personal communication 1996). Targeted 
gene disruption by transposon insertion was 
pioneered by Ronald Plasterk (Amsterdam, The 
Netherlands), and is now carried out in a number of 
laboratories: in this way, functionality for the pre- 
dicted genes can be determined [10]. At the National 
Genetics Institute, Mishima, Japan, Yuji Kohara is 
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Fig.29.2 The feature window of ACEDB showing a 
region of chromosome III cloned in the cosmid F10F2. To 
the left of the kilobase scale (i.e. on the negative DNA 
strand) a gene with two large introns is shown (F10F2.2; 
similarity to phosphoribosylformylglycinamidine 


synthase; highlighted). To the right of the scale (on the 
positive strand) a family of five genes (F10F2.4, F10F2.7, 
F10F2.6, FIOF2.8 and F10F5.5) is seen to lie within the 
introns. 
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continually adding to his set of sequence-tagged 
cDNAs and is determining their expression patterns 
by in situ hybridization (Y. Kohara, personal com- 


munication 1996). At Vancouver, Canada, David 
Baillie (Simon Fraser University) and Ann Rose 
(University of British Columbia) have generated 
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Fig. 29.3 The feature window of ACEDB showing a 
region of chromosome III cloned in the cosmid ZK637. 
The three genes ZK637.8, ZK637.9 and ZK637.10 lie head- 
to-tail on the positive strand. They have been shown to be 


transcribed as one operon [14]. Comparison with cDNA 
sequences (open boxes below the zoom-out button) 
shows that ZK637.8 has two alternative splicing patterns 
at position 23000. 
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Direct analysis of the sequence yields many interesting features, with applications to gene function, evolution and 


medicine. A few examples are: 


Genes within the introns of other genes. In one case five such genes were found within a single gene (Fig. 29.2) 

Clusters of tRNAs, containing five members in one case and six in another 

Gene families, some where the family members are dispersed, and others where they are close together in tandem arrays. 
We can begin to look at the evolution of the individual members 

Arelatively high incidence of inverted repeats within the introns of predicted genes. The functional significance of this, if 


any, is unknown 


Head-to-tail patterns of genes, which have been shown to be indicative of operons (Fig. 29.3) [14] 

Predictions of homologues to genes of medical importance, for example C50C3.7, the only known invertebrate gene 
similar to the human gene OCRL-1, implicated in Lowe’s syndrome 

The largest known C. elegans protein, predicted from a 45-kb gene in cosmid K07E12. It contains 13000 amino acid 
residues and contains multiple copies of the cell adhesion molecule motif [16,17] 

Many repeat families. Some are thought to be ‘dead’ transposons from an unknown and possibly extinct transposon type 


transgenic strains incorporating sequenced cos- 
mids; rescue of lethal and visible mutants by these 
strains will allow precise correlation of the genetic 
map with the sequence (refs 11-15 and A. Rose, 
personal communication 1996). 

By far the largest use of the genome map and 
sequence is for the study of specific C. elegans genes. 
Virtually all C. elegans laboratories worldwide make 
use of the map and most of them request clones from 
it. Increasingly, laboratories working on other 
organisms are also using these resources. Genome 
and sequence data for C. elegans is held in the data 
base ACEDB (accessible at various sites, see Chapter 
37). The database software (ACEDB) was designed 
as part of the C. elegans project to provide a data base 
tool specifically for use by biologists. ACEDB is the 
creation of Richard Durbin (The Sanger Centre, 
Cambridge, UK) and Jean Thierry-Mieg (CNRS, 
Montpellier, France), who also maintain the database 
itself. 

Applications of the sequence are on several levels. 
At the most mundane level, the prior determination 
of the sequence by an efficient large-scale operation 
simply saves subsequent effort and resources. 
Importantly, the sequence provides ready-made 
tools, such as a restriction map and information for 
making primers, that facilitate experimental design. 

More creatively, the sequence provides new entry 
points to the genome. Homologues of known genes 
or parts of genes can be sought by computer. Not 
only is this style of searching faster than physical 
probing, but also it is more thorough. Weak simi- 
larities, beyond the detection limit of hybridization, 
can be picked up and evaluated. At present, 
investigators can search a total of about 85 Mb 
(65Mb finished, 20Mb_ unfinished), containing 
perhaps 90 % of these genes. Searches will become 


virtually complete by the end of 1997, and will be 
greatly enhanced as the emerging families of genes 
and domains are subjected to cluster analysis and 
grouped by similarity. 

As the sequenced regions extend along the chro- 
mosomes, the large-scale structure of the genome 
starts to emerge. We are only just beginning to 
explore this level, but some of the early findings are 
illustrated in Figs 29.2 and 29.3 and Table 29.2. 
Figure 29.2 shows an annotated region of sequence 
of 30000 bp from chromosome III. Figure 29.3 shows 
10000 bp from chromosome III. Apart from patterns 
of genes, we begin to see the matrix of duplicated, 
inverted and transposed pieces of which the genome 
is composed. Somewhere, there are elements that 
mediate replication, recombination and segregation 
of the chromosomes, and others that control sex 
determination, dosage compensation and global 
gene expression. The sequence will provide the 
framework in which hypotheses about such mech- 
anisms can be developed and tested. 

Finally, the sequence forms a permanent archive 
whose value we can only begin to tap at the first 
pass. The analysis, modification, and above all 
comparison of sequences from different organisms 
will provide a major route to a full understanding of 
biology. 
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30.1 Introduction 


The budding yeast Saccharomyces cerevisiae may be 
viewed as one of the most important fungi used in 
biotechnology. Used in making bread and alcoholic 
beverages, yeast has served mankind for several 
thousands of years. The use of yeast as an experi- 
mental system dates back to the mid-1930s [1] and it 
has since received increasing attention. The elegance 
of yeast genetics and the ease of manipulation of 
yeast, and finally the technical breakthrough of yeast 
transformation to enable it to be used in reverse 
genetics, have substantially contributed to the 
explosive growth in yeast molecular biology [2-4]. 
Its success is also due to the fact, which was not 
anticipated a dozen years ago, of the remarkable 
conservation of basic biological processes through- 
out the eukaryotes. 


30.2 Yeast: an experimental system 
for molecular biology 


The unique position of S. cerevisiae as a model 
eukaryote owes much to its intrinsic advantages as 
an experimental system, in which cell architecture 
and fundamental cellular mechanisms can be 
successfully investigated. It is a unicellular organ- 
ism which, unlike more complex eukaryotes, can be 
grown on defined media, giving the investigator 
complete control over environmental parameters. 
Yeast is tractable to classical genetic techniques and 
functions in yeast have been studied in great detail 
by biochemical approaches [2-4]. In fact, a large 
variety of examples provides evidence that substan- 
tial cellular functions are highly conserved from 
yeast to mammals and that corresponding genes can 
often complement each other. No wonder then that 
yeast has again reached the forefront in experi- 
mental molecular biology by being the first eukary- 
otic organism for which the entire genome sequence 
is available [5,6]. The wealth of sequence infor- 
mation obtained in the yeast genome project has 
turned out to be extremely useful as a reference 
against which sequences of human, animal or plant 
genes may be compared. Moreover, the ease of 
genetic manipulation in yeast opens the possibility 
of functionally dissecting gene products from other 
eukaryotes in the yeast system. 


30.2.1 The yeast genome 


At 12.8 megabases (Mb), the yeast genome is about 
200 times smaller than the human genome but less 
than four times bigger than that of Escherichia coli. At 
the outset of the sequencing project, knowledge of 


some 1200 genes encoding either RNA or protein 
products had accumulated [7]. The complete 
genome sequence now defines some 6000 open 
reading frames (ORFs) which are likely to encode 
specific proteins in the yeast cell. A protein-coding 
gene is found every 2kb in the yeast genome, with 
nearly 70% of the total sequence consisting of ORFs 
[6]. In addition to the protein-coding genes, the yeast 
genome contains some 120 ribosomal RNA genes in 
a large tandem array on chromosome XII, 48 genes 
encoding small nuclear RNAs (snRNAs) and 275 
tRNA genes (belonging to 43 families) which are 
scattered throughout the genome. Finally, the 
sequences of nonchromosomal elements, such as the 
6kb of the 2-~m plasmid DNA, the killer plasmids 
present in some strains, and the yeast mito- 
chondrial genome (~75kb) have to be considered. 
None of the latter, however, has been included in the 
sequencing project; those sequences were largely 
determined in the 1980s. 

The compact nature of the S. cerevisiae genome is 
apparent when compared to more complex eukary- 
otic systems. For example, the genome of Caenorhab- 
ditis elegans contains a potential protein-coding gene 
only every 6 kb [8] and, in the human genome, gene 
density might be as low as one gene in 30 kb [9]. Cur- 
rent data (obtainable from S. Bowman and B. Barrell 
at http://www.sanger.ac.uk/yeast/pombe. html) 
indicate that even the genome of the fission yeast, 
Schizosaccharomyces pombe, has a lower gene density 
(one gene per 2.3 kb) than S. cerevisiae. The difference 
between the two yeast genomes appears to be due to 
the fact that in the fission yeast ~40% of the genes 
contain introns, whereas only 4% of the protein- 
coding genes in S. cerevisiae are interrupted by 
introns [6]. 


30.2.2 The chromosomes 


The genome of S. cerevisiae is divided up into 16 
chromosomes ranging in size from 250 to > 2500 kb. 
Choosing appropriate conditions, it is feasible to 
separate all 16 chromosomes by pulsed field gel 
electrophoresis (PFGE). This provides definition of 
‘electrophoretic karyotypes’ of strains by sizing 
chromosomes [10]. Laboratory strains possess 
different karyotypes, because of chromosome length 
polymorphisms and chromosomal rearrangements, 
but so do industrial strains. The gels can be utilized 
for Southern blotting followed by hybridization, or 
to isolate chromosome-specific DNA. 


30.2.3 Genetic mapping 


The first genetic map of S. cerevisiae was published 
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by Lindegren in 1949 [11]; many revisions and re- 
finements have appeared since, and the latest 
version, mapping some 1200 genes, was edited in 
1992 [7]. Both meiotic and mitotic approaches have 
been developed to map yeast genes. The life cycle of 
S. cerevisiae normally alternates between diplophase 
and haplophase. Both ploidies can exist as stable 
cultures. In heterothallic strains, haploid cells are of 
two mating types, a and a. Mating of a and « cells 
results in a/a diploids that are unable to mate but 
can undergo meiosis. The four haploid products 
resulting from meiosis of a diploid cell are contained 
within the wall of the mother cell (the ascus). 
Digestion of the ascus and separation of the spores 
by micromanipulation yield the four haploid meio- 
tic products. Analysis of the segregation patterns of 
different heterozygous markers among the four 
spores constitutes tetrad analysis and reveals the 
linkage between two genes (or between a gene and 
its centromere) [12]. On the whole, genetic distance 
in yeast appears to be remarkably proportional to 
physical distance, with a global average of 3kbcM1. 
Deviations from this rule and results from direct 
comparisons of the genetic and physical maps will 
be discussed below. 


30.2.4 Manipulations in yeast 


Yeast has a generation time of ~80min and mass 
production of cells is easy. Simple protocols are 
available for the isolation of high molecular weight 
DNA, rDNA, mRNA, and tRNA. It is possible to 
isolate intact nuclei or cell organelles such as 
intact mitochondria (maintaining respiratory com- 
petence). 

High efficiency transformation of yeast cells is 
achieved, for example, by the lithium acetate pro- 
cedure [13] or by electroporation. A large variety of 
vectors have been designed to introduce and to 
maintain or express recombinant DNA in yeast cells 
(see, for example, refs 4 and 14). Furthermore, a large 
number of yeast strains carrying auxotrophic mark- 
ers, drug resistance markers or defined mutations 
are available. Culture collections are maintained, for 
example, at the Yeast Genetic Stock Center and the 
American Type Culture Collection (ATCC). In the 
near future, mutant strains with defined gene 
deletions together with clones carrying the cor- 
responding gene cassettes will emerge from the 
EUROFAN project (see Section 30.7). 

A comprehensive library of recombinant lambda 
clones constructed as part of an S. cerevisiae physical 
mapping project and grouped in contigs [15] is 
maintained and distributed by ATCC. Ordered 
cosmid libraries using different vectors were con- 
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structed during the yeast sequencing project (see, for 
example, refs 16-18). 

The ease of gene disruptions and single step gene 
replacements is unique to S. cerevisiae and offers an 
outstanding advantage for experimentation. Yeast 
genes can functionally be expressed when fused to 
the green fluorescent protein thus allowing the local- 
ization of gene products in the living cell by fluores- 
cence microscopy [19]. The yeast system has also 
proved invaluable to clone and to maintain large 
segments of foreign DNA in yeast artificial chromo- 
somes (YACs), being extremely useful for other gen- 
ome projects [20], and to search for protein—protein 
interactions using the two-hybrid approach [21]. 


30.3 The yeast genome 
sequencing project 


30.3.1 Strategy 


The yeast sequencing project was initiated in 1989 
within the framework of the European Union 
biotechnology programs. It was based on a network 
approach into which initially 35 European labora- 
tories became involved [22], and chromosome III — 
the first eukaryotic chromosome ever to be 
sequenced—was completed in 1992 [23]. In the 
following years and engaging many more labora- 
tories, sequencing of further complete chromosomes 
was tackled by the European network. Soon after its 
beginning, laboratories in other parts of the world 
joined the project to sequence other chromosomes or 
parts thereof, ending up in a coordinated inter- 
national enterprise [24]. Finally, more than 600 
scientists in Europe, North America and Japan 
became involved in this effort. Figure30.1 shows 
how the tasks were distributed. The sequence of the 
entire yeast genome was completed in early 1996 
and released to public databanks in April 1996. 


30.3.2 Cloning and mapping procedures 


The sequencing of chromosome III started from a 
collection of overlapping plasmid or phage lambda 
clones that were distributed by the DNA coordinator 
to the contracting laboratories. In the following, 
cosmid libraries were constructed to aid large-scale 
sequencing (see, for example, refs 16-18 and 25). For 
yeast, cosmids turned out to be the most convenient 
tools in the construction and handling of genomic 
libraries, as 35-45 kb of DNA can be accommodated 
in a cosmid vector. Obvious advantages of cloning 
DNA segments in cosmids were: 

1 larger genes could be obtained on a single 
recombinant clone; 
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Fig.30.1 The yeast genome project. The scheme 
represents the yeast chromosomes in that order as they 
would appear after separation by PFGE. The 
chromosomes are drawn to scale and coded according to 
the different collaborating programs (names are those of 
the coordinators) for determiming the sequences of 
particular chromosomes or regions thereof. Large 
tandem arrays such as those containing the ribosomal 
DNA repeats (chromosome XII), CUP1 repeats 
(chromosome VIII), and PMR2 repeats (chromosome IV) 


2 several linked genes could be isolated together 
with their intergenic regions; 

3 fewer colonies had to be maintained and screened 
to isolate a clone of interest; 

4 cosmid clones turned out to be stable for many 
years under usual storage conditions. 

Additionally, the isolation of sequentially over- 
lapping cosmid clones has facilitated physical link- 
age over the entire yeast chromosomes. 

To construct a library with as complete coverage 
as possible with as few clones as possible, the cloned 


are each represented in the sequence by only two repeat 
units (white boxes). The complete sequence (12 052 kb) is 
available in annotated database entries; a compilation of 
useful computer addresses is presented in the annex. 
References to publications of the single chromosomes: I 
[52]; 1 [28]; WI [23]; VI [100]; VIN [18]; X{101]; XI [71]. 
Publications to further chromosomes (IV [102], V [103], 
VII [104], IX [105], XII [106], XIII [107], XIV [108], XV [109], 
XVI[110]) and a general overview [111] will appear 
cumulatively [112]. 


DNA fragments should be randomly distributed on 
the DNA. Under these conditions, the number of 
clones (N) in a library representing each genomic 
segment with a given probability (P) is 


N=In(1-P)/In (1-f) 


where f is the insert length expressed as fraction of 
the genome size [26]. Assuming an average insert 
length of 35kb, a cosmid library containing 4600 
random clones would represent the yeast genome at 
P=99.99%, i.e. about 12 times the genome equiv- 
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alent. The actual number of cosmid clones obtained 
by the usual procedures is very high (> 200000 per 
microgram of DNA). 

This small number of clones was advantageous in 
setting up ordered yeast cosmid libraries or sorting 
out and mapping chromosome specific sublibraries. 
For example, a chromosome XI specific sublibrary 
composed of 138 clones had been sorted out from an 
unordered cosmid library by colony hybridization 
using DNA from chromosome XI purified by PFGE 
as a probe. The nested chromosomal fragmentation 
method [27] was then applied to rapid sorting of 
these clones. Finally, a set of some 30 overlapping 
cosmids was sufficient to build a contig of chromo- 
some XI. In the following, this approach has been 
successfully applied to most of the other chromo- 
somes sequenced in the yeast genome project. We 
have selected 3500 independent clones from a total 
yeast cosmid library to prepare the DNA of each 
single cosmid from minilysates of 5 ml cultures [16]. 
These samples were numbered and the corre- 
sponding cultures kept as glycerol stocks at —-70°C. 
By chromosomal walking, 43 of these clones have 
been sorted out to cover yeast chromosome II [28]. 
For cosmid cloning of chromosome II DNA, we 
employed a vector that carries a yeast marker and 
therefore could be used in direct complementation 
experiments [16]. 

For sequencing chromosome VIII [18], partially 
overlapping phage lambda and cosmid clones were 
used, which were previously mapped for HindIII 
and EcoRI sites [29]. It may be noted that by 
convention of all laboratories engaged in sequenc- 
ing the yeast genome, the strain 0S288C or isogenic 
derivatives thereof were chosen as the source of 
DNA, as these strains have been fairly well charac- 
terized and employed in many genetic analyses. 

High-resolution physical maps of the respective 
chromosomes were constructed by application of 
classical mapping methods (fingerprints, cross- 
hybridization) or by novel methods developed for 
this programme, such as site-specific chromosome 
fragmentation [27] or the high-resolution cross- 
hybridization matrix [30], to facilitate sequencing 
and assembly of the sequences. These techniques 
might be of interest for other genomes as well and, 
particularly, for mapping YAC inserts. 


30.3.3 Sequencing strategies, sequence assembly 
and quality control 


30.3.3.1 Sequencing strategies 

In the European network, chromosome-specific 
clones were distributed to the collaborating labor- 
atories according to a scheme worked out by the 


DNA coordinators. Each contracting laboratory was 
free to apply sequencing strategies and techniques 
of its own provided that the sequences were entirely 
determined on both strands and unambiguous 
readings were obtained. Two principal approaches 
were used to prepare subclones for sequencing: 

1 generation of sublibraries by the use of a series 
of appropriate restriction enzymes or from nested 
deletions of appropriate subfragments made by 
exonuclease III; 

2 generation of shotgun libraries from whole 
cosmids or subcloned fragments by random shear- 
ing of the DNA. Sequencing by the Sanger technique 
(see Chapter 22) was either done manually, labelling 
with [*S]dATP being the preferred method of 
monitoring or by automated devices (on-line 
detection with fluorescence labelling or direct blot- 
ting electrophoresis system) following the various 
established protocols. Similar procedures were 
applied to the sequencing of the chromosomes 
contributed by the Sanger laboratory and the 
laboratories in North America, Canada and Japan. 
The American laboratories largely relied on machine- 
based sequencing. 


30.3.3.2 Sequencing telomeres 

The yeast chromosome telomeres presented a parti- 
cular problem. Due to their repetitive substructure 
and the lack of appropriate restriction sites, conven- 
tional cloning procedures were successful only for 
a few exceptions. Largely, telomeres were physically 
mapped relative to the terminal-most cosmid in- 
serts using the I-Scel chromosome fragmentation 
procedure [27]. The sequences were then deter- 
mined from specific plasmid clones obtained by 
telomere trap cloning, an elegant strategy developed 
by E. Louis [31,32]. 


30.3.3.3 Sequence assembly and quality control 

Within the European network, all original sequences 
were submitted by the collaborating laboratories 
to the Martinsried Institute of Protein Sequences 
(MIPS) who acted as an informatics centre. They 
were kept in a data library, assembled into pro- 
gressively growing contigs, and updated during the 
course of the project. In collaboration with the DNA 
coordinators, the final chromosome sequences were 
derived. Starting with chromosome XI, all sequences 
submitted by the collaborating laboratories were 
subjected to quality controls. ‘Verifications’ (amount- 
ing toa total of 450 kb) were achieved by anonymous 
resquencing of selected regions, either long frag- 
ments (total of 15-20% per chromosome) or short 
segments (total of 1-2% per chromosome) chosen 
from suspected or difficult zones which were 
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resequenced directly from cosmids using desig- 
nated pairs of oligonucleotides as primers. 

Similarly, automated procedures were employed 
for sequence assembly in the other laboratories, 
based, for example, on the programme package 
developed at Cambridge (see, for example, ref. 33) 
or on the ACeDB programme developed for the C. 
elegans genome project [34]. In any case, correct 
assembly of the sequences was guaranteed by estab- 
lishing that the order of restriction sites predicted 
from the sequence was consistent with the physical 
maps of these sites that had been determined inde- 
pendently and care was taken to perform quality 
controls that would result in a high accuracy [6]. 

In spite of all precautions, determination of a 
sequence cannot be error-free. From theoretical 
considerations [6] taking all types of errors together, 
it follows that with an average sequence accuracy of 
99.9%, only a third of all yeast genes are properly 
described, whereas fidelity is brought to 85%, if a 
sequence accuracy of 99.99% is reached. The 
systematic sequencing programs with verifications 
resulted in a sequence accuracy of ~99.97% cor- 
responding to a gene accuracy of some 75%. In 
practice, care was taken to minimize frameshift 
errors, which represent about two thirds of all 
sequencing errors and will have the most deleter- 
ious effects on gene interpretation. 


30.3.3.4 Sequence analysis 
Along with data submission by the single labor- 
atories, and finally when the complete sequences 
of the chromosomes were available, they were 
subjected to analysis by various algorithms. The 
sequences have been interpreted using the following 
principles: 
1 all intron splice site/branch-point pairs detected 
by using specially defined patterns (ref. 35; Kleine, 
K. and Feldmann, H., unpublished) were listed; 
2 all ORFs containing at least 100 contiguous sense 
codons and not contained entirely in a longer ORF 
on either DNA strand were listed (this includes 
partially overlapping ORFs); 
3 the two lists were merged and all intron splice 
site/branch-point pairs occurring inside an ORF but 
in opposite orientation were disregarded; 
4 centromere and telomere regions, as well as tRNA 
genes and Ty elements or remnants thereof, were 
sought by comparison with previously charac- 
terized datasets of these elements (Kleine, K. and 
Feldmann, H., unpublished data) including the 
data-base entries provided in a continuously up- 
dated library of tRNAs and tRNA genes [36] as well 
as the program tRNA Scan [37]. 

In the European network, special software de- 
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veloped for the VAX at MIPS was used to locate 
and translate open reading frames (ORFEX and 
FINDORF), to retrieve noncoding intergenic 
sequences (ANTIORFEX), and to display various 
features of the sequence(s) on graphic devices 
(XCHROMO; an interactive graphics display 
program). 

Searches for similarity of proteins to entries in the 
databanks were performed by FASTA [38], BLAST 
[39], and FLASH [40], in combination with public 
protein sequence databases. Protein signatures were 
detected by using the PROSITE dictionary [41]. 
ORFs were considered to be homologues or to have 
probable functions when the alignments from these 
searches showed significant similarity and/or 
protein signatures were apparent. Compositional 
analyses of the chromosome (base composition; 
nucleotide pattern frequencies, GC profiles; ORF 
distribution profiles, etc.) were performed by using 
the X11 program package (C. Marck, unpublished). 
For calculations of CAI and GC content of ORFs the 
algorithm CODONS [42-44] was used. Compar- 
isons of the chromosome sequences with databank 
entries at MIPS were based on a new algorithm 
developed there [45]. Furthermore, particular nucleo- 
tide patterns were searched for, which will be men- 
tioned below. Basically, the same strategies were 
used by other laboratories to interprete their 
sequences, again combining well-established rout- 
ines with special software developed in these 
laboratories. 


30.4 Life with 6000 genes 


30.4.1 The proteome: open reading frames 
and gene function 


The term proteome has been coined to describe the 
complete set of proteins synthesized by a living cell 
[46]. With the completion of the yeast genome 
sequence, for the first time, we can now define the 
proteome of a eukaryotic cell. 

The sizes of the majority of the ORFs in yeast vary 
between 100 to more than 4000 codons (Fig. 30.2). 
Less than 1% of the ORFs is estimated to be below 
100 codons; the smallest mature peptides that have 
been characterized are the two mating pheromones. 

Comparison of the final sequence with public 
databases revealed that some 28.11% of the yeast 
ORFs correspond either to previously known 
protein-coding genes or to genes whose functions 
have been determined previously or during the 
course of the project. An estimated 6% of the total 
remain questionable ORFs. Thus, 66% of the total 
ORFs represent novel putative yeast genes. As far as 
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we can extrapolate from transcriptional mapping of 
particular chromosomes, the majority of these ORFs 
should represent ‘real’ genes, though many of those 
appear to be transcribed at an extremely low level 
[47,48]. Of the total ORFs, 14.8% have homologues 
among gene products from yeast or other organisms 
whose functions are known, whereas another 14.4% 
of the total have recognizable motifs or weak 
homologies to genes for experimentally character- 
ized functions. The remaining 37.7% of the total 
ORFs have either homologues to ORFs of unknown 
function on other chromosomes (26.2% of the total) 
or no homologues in data libraries at all (10.5% of the 
total). Thus, ~2200 of the yeast genes have to be 
catagorized as ‘genes of unknown function’, some- 
times called ‘orphans’ [6]. 

It is noteworthy that the number of yeast genes 
to which functions could be attributed through 
comparisons with the database entries from all 
organisms has not substantially increased on com- 
pletion of the sequence, although this information 
grew rather exponentially in the past few years. A 
similar observation was made when the complete 
genomes of small prokaryotes, such as Haemophilus 
influenzae (1.8Mb) [49], Mycoplasma  genitalium 
(0.6 Mb) [50], and Methanococcus jannaschii (1.7 Mb) 
[51] were determined: a large proportion of the genes 
have no counterparts in other organisms. Probably, 
therefore, we deal with a general phenomenon 
where it looks as if many of the novel functions only 
require transient or low-level transcription in an 
organism or are primarily phylum specific. 

Now that the complete sequence of the yeast 
genome is available, it will be interesting to 
systematically compare all of the ORFs which can 
be classified according to the presence of known 
functional motifs (or protein signatures). A useful 
inventory list of the yeast proteins has been 


compiled by J.I. Garrells (http: / /quest7.proteome. 
com/YPDhome.htm]l). 


30.4.2 Overlapping ORFs, pseudogenes and 
introns 


A few cases have been found where overlapping 
ORFs indeed exist and are expressed. In one 
particular case, it was even shown that expression of 
the two ORFs occurs at different stages of yeast 
growth. Another interesting question was, how 
many pseudogenes might be present in the yeast 
genome. From earlier studies, it was anticipated 
that this number in yeast should be low compared 
with that in mammalian genomes. Generally, this 
assumption seems to hold true for most of the yeast 
chromosomes, but chromosome I turned out to be 
an exception [52]. Chromosome I is the smallest 
naturally occurring functional eukaryotic nuclear 
chromosome so far characterized. The central 165 kb 
resemble other yeast chromosomes in both high 
density and distribution of genes. In contrast, the 
remaining sequences flanking this DNA (the two 
ends of the chromosome) have a much lower gene 
density, are largely not transcribed, contain no 
essential genes for vegetative growth, and contain 
four apparent pseudogenes and a 15-kb redundant 
sequence. These terminally repetitive regions con- 
sist of a telomeric repeat, flanked by DNA closely 
related to FLO1, a yeast gene involved in cell 
flocculation and encoding a large serine/threonine- 
rich cell wall protein with internal repeats. The 
pseudogenes are related to known yeast genes but 
have internal stop codons. Extreme care has been 
taken in such cases to reconfirm the sequences of the 
regions in question by independent laboratories. 
Only a minor fraction of the yeast genes, around 
4% of the total, are predicted (or already experi- 
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mentally shown) to be interrupted by introns. To 
date, only two cases have been encountered where 
two introns are present: the MAT locus on chro- 
mosome III and a ribosomal protein gene, RPL6 A, 
on chromosome VII. In the latter, the second intron 
encodes a small RNA. Generally, the intron is 
located at the extreme 5’-end of each gene, some- 
times even preceding the coding region. The predo- 
minant population of intron-containing genes is 
recruited by the ones encoding ribosomal proteins. 
The functional significance of the introns is by no 
means clear, and despite the low number of intron- 
containing genes yeast maintains a highly sophis- 
ticated and complex machinery for splicing. 


30.4.3 Putative membrane proteins, 
mitochondrial proteins 


The ALOM algorithm [53] can be applied to predict 
putative membrane spans. A first estimate for 
chromosome III revealed that some 38% of the ‘real’ 
genes may code for transmembrane proteins cont- 
aining 1-14 potential membrane transversions [54]. 
A similarly high figure (142 ORFs out of 410) was 
found for chromosome II [28]. Data obtained from 
other systematically sequenced yeast chromosomes 
suggest that this appears to apply as a general rule in 
yeast [55,56]. Even though the algorithm may give a 
somewhat high estimate, possibly a third of yeast 
proteins have to be considered to be associated with 
membrane structures. 

Examination of the ORFs for the occurrence of 
putative mitochondrial target signal sequences is 
difficult due to the complex character of these 
signatures [57] this can only be achieved by visual 
inspection. Since not all of the proteins participating 
in mitochondrial biogenesis are imported via 
particular signal sequences, the exact number of 
proteins involved in maintaining mitochondrial 
function in yeast remains unknown at present. A 
rough estimate is that some 8% of the yeast proteins 
may be involved in mitochondrial biogenesis. 


30.4.4 Other genetic entities 


In addition to the genes encoding proteins, we have 
obtained detailed information on the organization of 
the genes for tRNAs and other small RNAs, the 
yeast retrotransposons (termed Ty elements), as well 
as the telomeric and centromeric sequences. The 
genes for the ribosomal RNAs are clustered in some 
100 copies on the right arm of chromosome XII, 
whereas the multiple copies of tRNA genes are 
found scattered throughout the genome. 

Five different types of Ty elements, which exhibit 


substantial homology to retroviruses and retrotrans- 
posons from plants and animals, are present in the 
yeast genome: Ty1, Ty2, and Ty4 belong to the ‘copia’ 
class of retrotransposons, while Ty3 is a member of 
the ‘gypsy’ family (for a review, see ref. 58). oS288C 
contains 32 complete Tyl and 13 Ty2 elements, 
whereas Ty4 is present at only three locations and 
Ty3 occurs in two copies. Ty5, found in chromosome 
III, appears to be a new class of yeast transposon. 
Like retroviruses, the Ty elements transpose through 
an RNA intermediate and by reverse transcription. 
Transposition rates are low, and the number of 
elements is kept fairly constant by balancing trans- 
position and excision events [59]. This is manifest 
from the presence of 268 long terminal repeats 
(LTRs) or remnants thereof that are footprints of 
previous transposition events. Due to the vagabond 
life-style of the retrotransposons, yeast strains differ 
with respect to the sometimes rather complex 
‘patterns’ formed by these elements resulting from 
multiple integrations and excisions. However, 
comparison of different yeast strains (see, for 
example, refs 60 and 61) and experimental data [62] 
revealed that spontaneous transposition events do 
not appear to occur randomly along the length of 
individual chromosomes but that the Ty elements 
are preferably integrated into the upstream regions 
of tRNA genes [63]. Since these regions do not 
contain any special DNA sequences, the region- 
specific integration of the Ty elements may be due to 
specific interactions of the Ty integrase(s) with the 
transcriptional complexes formed over the intra- 
genic promoter elements of the tRNA genes or 
triggered by positioned nucleosomes in the 5’ 
flanking regions of the tRNA genes (see, for example, 
ref. 64). In any case, the Ty integration machinery 
can detect regions of the genome that may represent 
‘safe havens’ for insertion, thus guaranteeing both 
survival of the host and the retroelement. 

Analysis of the sequenced yeast genome clearly 
substantiates the earlier observations of consider- 
able plasticity of the yeast genome around tRNA 
gene loci and the existence of ‘transposition hot- 
spots’ (see, for example, refs 60, 61 and 65). 


30.5 Genome architecture and 
gene organization 


30.5.1 Gene density and gene arrangement 


It is now well established that the gene density in all 
yeast chromosomes is rather similar. Excluding the 
ORFs contributed by the Ty elements, ORFs occupy, 
on average, 70% of the sequences. This leaves only 
limited space for the intergenic regions which can be 
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thought to harbour the major regulatory elements 
involved in chromosome maintenance, DNA repli- 
cation and transcription. Regarding transcription 
of protein-coding genes, a variety of elements 
have been identified and characterized that are 
operative in transcriptional initiation, regulation 
and termination. Not all of the yeast genes are 
preceded by a canonical TATA box, and it remains 
still open which type of AT-rich sequences or other 
elements can act as transcriptional initiation sites 
[66]. In some cases, terminator sequences have been 
defined, but no general consensus sequences can be 
deduced. The same holds true for polyadenylation 
sites and polyadenylation signal sequences. Where 
experimentally determined, it appears that there is 
a much larger variability to these sequences than 
in mammalian systems [67]. As in mammalian or 
plant systems, a number of regulatory cis-acting 
elements (upstream acting sequences; UAS) and the 
corresponding trans-activating factors have been 
experimentally characterized in yeast (for a review, 
see ref. 68). Also negative regulatory elements 
(upstream repressing sequences; URS) have been 
shown to control the expression of some genes. 
However, in a few instances, precise ideas on the 
intimate interplay of the various regulatory com- 
ponents mediating gene expression are beginning to 
evolve. The knowledge of the entire genome 
sequence, combined with the powerful genetic tools 
available for yeast, should now foster research along 
these lines. 


Generally, ORFs appear to be rather evenly 
distributed among the two strands of the single 
chromosomes. In some chromosomes (e.g. I, II, VID, 
there is a slight excess of coding capacity on one of 
the strands, the significance of which is not known. 
Figure 30.3 presents a scheme of how the single tran- 
scriptional units are organized along yeast chromo- 
somes. Three principal arrangements are possible: 

1 ‘head-to-tail’ orientation of two adjacent genes, so 
that transcription occurs in the same direction and 
the intergenic regions should carry a terminator for 
one gene and a promoter for the next one to follow; 
2 ‘head-to-head’ orientation, in which transcription 
of two genes is divergent from a common ‘promoter’ 
region; 

3 ‘tail-to-tail’ orientation, by which two genes share 
a ‘terminator’ region. 

There is no predominance of one or the other type 
of gene arrangement, although arrays longer than 
eight genes that are transcriptionally orientated in 
the same direction can be found on several chromo- 
somes. The extreme seems to be a region from 
chromosome VIII, where 17 in a run of 18 ORFs are 
located on the ‘top’ strand. 

In the ‘head-to-tail’ arrangements, the intergenic 
regions between two consecutive ORFs sometimes 
are extremely short, raising the question of whether 
they are maintained as separate units or coupled for 
transcription and translation. There are cases in 
which different functions have been combined in 
one genetic unit but, to the best of our knowledge, 
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Fig. 30.3 Organization of yeast genes along 
chromosomes. The average base composition of yeast 
DNA is 38.4% (G +). As expected, the protein-coding 
regions have a higher GC content, on average (40.2%), 
than the noncoding regions (35.1%). In sliding windows, 
coding regions may be discriminated from intergenic 
regions, since ‘transitions’ in GC content are rather sharp 
at their borders. An almost symmetrical distribution of 
dinucleotide frequencies over the entire chromosome is 
apparent, whereas the base composition of ORFs shows a 


significant excess of homopurine pairs on the coding 
strand. Normally, coding regions are evenly distributed 
between the two strands. The average ORF size is 

1450 bp. The average sizes of interORF regions vary 
between 630 and 945 bp for different chromosomes, they 
are 618 bp on average for ‘divergent promoters’ (36.2% 
GC) and 326 bp for ‘convergent terminators’ (29.3% GC), 
while ‘promoter-terminator combinations’ (34.2% GC) 
are 517 bp in length on average. 
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polycistronic messages have not been observed in 
yeast to date. 

Initially, the intervals between divergently tran- 
scribed genes might be interpreted to mean that 
their expression is regulated in a concerted fashion 
involving the common promoter region. This, 
however, seems not to hold for the majority of the 
genes and might be a principle reserved for a few 
cases, in which these genes belong to the same 
regulatory pathway (e.g. GAL1/GAL10 [69]). By 
contrast, many examples are known in which a 
constitutively expressed gene shares its upstream 
sequences with that of a highly controlled gene. 
Regarding the fact that most of the intergenic 
regions are relatively short (cf. Fig.30.3), an intri- 
guing question becomes apparent: are regulatory 
elements confined to these sequences or could they 
also be present in coding sequences of neigh- 
bouring genes located upstream? Experimental 
data obtained for several genes involved in meiosis 
point to the latter possibility [70]. This would 
enable two different kinds of constraint to be super- 
imposed on sequences during evolution, one for 
maintaining function of coding sequences and one 
for preserving regulatory sequences. By employing 
catalogues of consensus sequences of the known 
regulatory elements (H. Feldmann, unpublished 
data), one can detect many sites within the 
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intergenic regions which could be thought to be 
functional, but at the same time, such sites are found 
scattered throughout the coding regions as well. 
Although the functional significance of these 
sequences might strictly depend on the regional 
context, it is difficult, at the moment, to discriminate 
between functional and nonfunctional elements 
merely by inspection of the sequence. Only experi- 
mental approaches will answer this problem. 


30.5.2 Base composition and gene density 


Average base composition has been found to be 
symmetrical over the entire genome (the symmetry 
being even more apparent with dinucleotide fre- 
quencies), but this only reflects the almost equal 
numbers of ORFs encoded on each DNA strand of 
most of the yeast chromosomes, the base composi- 
tion of ORFs themselves showing a significant ex- 
cess of homopurine pairs on the coding strand [71]. 
Regional variations of base composition with 
similar amplitudes were first noted along chromo- 
some III [72], with major GC-rich peaks in the 
middle of each arm. Results from chromosome 
XI confirmed this finding, but owing to its larger 
size, revealed an almost regular periodicity of the 
GC content, with a succession of GC-rich and GC- 
poor segments. A most interesting observation was 
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that the compositional periodicity correlates with 
local gene density, which reaches more than 85% in 
GC-rich regions, followed by segments of com- 
parably lower gene density (50-55%) in AT-rich 
regions [71]. Other chromosomes also show com- 
positional variation of similar range along their 
arms, with pericentromeric and subtelomeric regions 
being AT-rich, though spacing between GC-rich 
peaks is not always regular. In most cases, however, 
there is a broad correlation between high GC content 
and high gene density as is the case in more complex 
genomes in which isochores of composition are 
naturally much larger [73]. 

The profiles obtained from the analyses of 
chromosomes II [28] are exemplified in Fig. 30.4. GC- 
poor peaks coinciding with relatively low gene 
densities are located at the centromere (around 
coordinate 230) and at both sides of the centromere, 
with a periodicity of ~110kb. These minima are 
more pronounced around coordinates 120, 340 and 
560, while they are less so at coordinates 450 and 670. 
Remarkably, most of the tRNA genes reside in GC- 
poor ‘valleys’ and the Ty elements became even- 
tually integrated into these regions. When analysing 
chromosome II for the occurrence of simple repeats, 
putative regulatory signals, and potential ARS 
elements, we noticed that the latter ones were not 
found randomly distributed. In Fig.30.4, we have 
listed the location of 36 ARS elements which 
completely conform to the 11bp degenerate con- 
sensus sequence [74,75]. Several of these were found 
associated at their 3’ extensions with imperfect 
(1-2 mismatches) parallel and/or antiparallel ARS 
sequences or putative ABF1-binding sites, reminis- 
cent of the elements reported to be critical for repli- 
cation origins [76,77]. Remarkably, these patterns are 
found within the GC valleys, suggesting that 
functional replication origins might preferably be 
located in AT-rich regions. This phenomenon was 
also apparent from an analysis of chromosome XI 
and, more convincingly, when the distribution of 
functional replication origins mapped in chromo- 
some VI [78] or in 200kb of chromosome III [79] 
were compared to the GC profiles of these 
chromosomes. Functional ARS elements have yet to 
be defined for the remainder of chromosome III and 
the other yeast chromosomes. In this context, it 
would be interesting to see whether the origins of 
replication reveal a regular spacing [80] and whether 
these and the chromosomal centromeres might 
maintain specific interactions with the yeast nuclear 
scaffolding [81]. In all yeast chromosomes analysed 
thus far, ARS elements located in the subtelomeric 
regions are closely associated with specific OBF- 
binding sites [82,83]. 
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Although the fairly periodic variation of base 
composition is now evident for the yeast chromo- 
somes, its significance remains unclear. Several 
explanations for the compositional distribution and 
the location-dependent organization of individual 
genes have been offered. For example, the composi- 
tional periodicity of a yeast chromosome could 
reflect the evolutionary history of the chromosome, 
together with the folding of that chromosome, its 
attachment to the nuclear matrix or structural 
elements involved in chromosome segregation, or 
in the ‘homology search’ that precedes synapsis 
in the early meiotic prophase. Other possibilities 
could be tested experimentally. For example, tran- 
scription mapping of a whole chromosome could 
give a clue as to whether such rules may influence 
the expression of genes. Furthermore, long-range 
determination of DNase I sensitive sites may be 
used to find a possible correlation between compo- 
sitional periodicity and chromatin structure along a 
yeast chromosome. 


30.5.3 Telomeres 


The organization of the yeast telomeres (Fig. 30.5) 
has become clear from the work of E. Louis and his 
collaborators in conjunction with the chromosome 
sequences. All yeast chromosomes share charac- 
teristic telomeric and subtelomeric structures [32]. 
Telomeric (C1-3 A) repeats, some 300 nucleotides 
in length, are found at all telomere ends. Thirty-one 
out of 32 chromosome ends contain the X core 
subtelomeric elements (400bp), and 21/32 of the 
chromosome ends carry an additional Y’ element. 
There are two Y’ classes, 5.2 kb and 6.7 kb in length, 
both of which include an ORF for a putative RNA 
helicase of yet unknown function. Y’ element show 
a high degree of conservation but vary among 
different strains [84]. Experiments with the est] (ever 
shortening telomeres) mutants, in which telomeric 
repeats are progressively lost, have shown that the 
senescence of these mutants can be rescued by a 
dramatic proliferation of Y’ elements [85]. Several 
additional functions have been suggested for these 
elements (for a review, see ref. 86), such as extension 
of telomere-induced heterochromatin or protection 
of nearby unique sequences from its effects; a role in 
the positioning of chromosomes within the nucleus. 

Comparisons of the chromosome termini between 
each other revealed that, in addition to the common 
subtelomeric repeats, they share extended simi- 
larities in their subtelomeric regions: genetic redun- 
dancy is the rule at the ends of yeast chromosomes. 
The ‘duplicated’ regions contain copies of genes of 
known or predictable function as well as several 
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Fig.30.5 Organization of yeast telomeres. The 
chromosome ends of the first three chromosomes to be 
completed are compared. Repetitious sequences of high 
similarity at the nucleotide level are indicated by the 


ORFs the putative products of which exhibit high 
similarity. The functions of the latter remain unclear 
as no homologues of known function have been 
found in the databases. 

For instance, the two terminal domains of chro- 
mosome III show considerable sequence homology 
(up to 18 kb), both to one another and to the terminal 
domains of other chromosomes (II, V, and XI). In 
chromosome VIII, extensive duplications being 
present on other chromosomes have been observed: 
30kb near the right telomere is more than 90% 
identical to the similar region on the right arm of 
chromosome I. A smaller portion of this is also 
duplicated on the left arm of chromosome I. Shorter 
duplicated segments in the subtelomeric region of 
the left arm of chromosome VIII have been encoun- 
tered in the subtelomeric regions of chromosomes 
Ill and XI (showing 54-94% identity). 


30.5.4 Complex and simple repeats 


Overall, the yeast genome is remarkably poor in 
repeated sequences. The unique constellation of 
repetitive sequences at the two ends of chromosome 
Thas already been pointed out. Approximately 30kb 
in each subtelomeric region carry similar (but non- 
essential) genes and a 15-kb repeat. These features 
are consistent with the idea that these terminal 
regions represent the yeast equivalent to hetero- 
chromatin and the occurrence of this type of DNA 
suggests that its presence gives this chromosome the 
critical length required for proper stability and 
function. The 30-kb region can be removed from 


filled triangles. ORFs are represented by the arrows. The 
consensus telomere sequences are shown in black; their 

substructure [32] is indicated in the insert (not drawn to 
scale). 


each end without affecting vegetative growth, 
although chromosome stability is considerably 
reduced. Most likely, these repeated regions contri- 
bute to chromosome I size polymorphisms which 
have been observed [52]. Besides the Ty elements, 
it is the rONA on chromosome XII that most 
significantly contributes to repetitiveness. A cluster 
of some 15 tandem repeats (2kb each) containing 
the CUP1 gene and contributing to polymorphic 
variation is found on chromosome VIII [18]. 

Repeated stretches of short oligonucleotides exist. 
These include poly(A) or poly(T) tracts, alternating 
poly(AT) or poly(TG) tracts, and direct or inverted 
long repeats. Even short stretches of the simple 
sequence repeat (TG,,), normally ‘sealing’ the 
chromosome ends have been encountered internal 
to some chromosomes. This type of internal repeats 
are probably relics of events during breakage and 
healing of chromosomes. 

By applying the program PYTHIA [87] to search 
for simple repeats (Chapter 25), we detected at least 
12 sets of regularly repeated trinucleotides along 
chromosome II representing repetitious codons for 
particular amino acids, thus forming homopeptide 
stretches. In some cases, even more complex amino 
acid patterns result (Table 30.1). A systematic study 
on the distribution and variability of trinucleotide 
repeats in the yeast genome is under way [88]. 
Perfect and imperfect repeats ranging from four to 
130 triplets were recognized and the repartition of 
different triplet combinations was found to differ 
between ORFs and intergenic regions. Examination 
of various laboratory strains revealed polymorphic 
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Table 30.1 Simple repeat sequences on chromosome II corresponding to amino acid homopolymer stretches. 
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ORF Gene Amino acid repeat Nucleotide repeat 
YBLO084c CDC27 Asn (>24) AAT 
YBLO081w Asn AAC 

Ser 
YBLO29w Asn ue 

Ser & Pro complex 
YBLO11wWw Glu GAGGAA 
YBLO007c SLAI1 QOQPOMMN AC-rich 
YBRO16w Gln CAG/CAA 
YBRO40w FLLI CAAA 
YBRO67c TIP1 Ser complex 
YBR112c SSN6 Gln CAA/CAG 

Gln-Ala CAAGCA 
YBR150c Asn TTA 

Asp (>21) GAT 
YBR289w SNF5 Gln CAACAG 


size variations for all perfect repeats, compared to an 
absence of variation for the imperfect ones. These 
findings are particularly interesting in view of the 
fact that several human genetic disorders are caused 
by trinucleotide expansion. The yeast system may 
now provide an experimental approach to study the 
mechanisms of their expansion. 


30.5.5 Comparison of genetic and physical maps 


The genetic map of S. cerevisiae [7] was of consider- 
able value to yeast molecular biologists before 
physical maps became available. In fact, we and 
others have used DNA probes from some known 
genes mapped to particular chromosomes for 
chromosomal walking. Finally, however, physical 
maps of all chromosomes have been constructed 
without reference to the genetic maps. 

Beside local expansion or contraction of the 
genetic map, and the fact that the overall frequency 
of meiotic recombination increases with shortening 
chromosome size, the order of the genes positioned 
on the chromosomes by genetic and physical map- 
ping grossly agree. Thus, the comparison of the 
physical and genetic maps show that most of the 
linkages give the correct gene order but that in many 
cases the relative distances derived from genetic 
mapping are imprecise. The obvious imprecisions of 
the genetic maps may be due to the fact that different 
yeast strains have been used in establishing the 
linkages. It is even possible that some strains used 
in genetic mapping experiments show inversions 
or translocations which then might contribute to 
discrepancies between physical and genetic maps. 
Clearly, the accuracy of genetic mapping will 
depend on the experimental approaches used. For 


example, a deviation between the genetic and the 
physical maps initially observed with chromosome 
XI [71] could be corrected by repeating the genetic 
mapping of a segment located next to the left 
telomere [89]. A more widespread phenomenon, 
however, that may lead to imprecisions of the 
genetic maps are strain polymorphisms caused by 
the extended repetitive sequences or subtelomeric 
duplicated genes mentioned above, and particularly 
by the Ty elements. Altogether, the experience 
gained from the yeast genome project shows that 
genetic maps provide valuable information but that 
independent physical mapping and determination 
of the complete sequences is needed to unambigu- 
ously delineate all genes along chromosomes. At the 
same time, the differences found between various 
yeast strains demonstrate the need to use one par- 
ticular strain as a reference system. 


30.6 Genome organization and 
evolutionary aspects 


30.6.1 Genetic redundancy in yeast 


A survey of previous sequence data and sequences 
obtained in the yeast sequencing project suggested 
that there is a considerable degree of internal genetic 
redundancy in the yeast genome. Although an 
estimate of sequence similarity (both at the nucleo- 
tide and the amino acid level) is now possible, it still 
remains difficult to correlate physical and functional 
redundancy, because even in yeast gene functions 
have been precisely defined only to a limited extent. 
Understanding the true nature of redundancy will 
help elucidate the biological role of every yeast 
gene. 
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30.6.1.1 Duplicated genes in subtelomeric regions 
Classical examples of redundant genes in yeast are 
the MEL, SUC, MGL and MAL genes, which have 
been mentioned earlier. In fact, yeast strains differ 
by the presence or absence of particular sets of these 
genes. For example, three genes mapped on 
chromosome II of wild-type strains, MEL1, SUC3, 
and MGL2, are absent from the strain 0S288C. A 
comparison at the molecular level of «0S288C with 
brewer’s yeast strain C836 clearly shows that the 
SUC genes are present on chromosome II of the 
latter strain [90]. Regarding the genes involved in 
carbohydrate metabolism, the presence of multiple 
gene copies could be attributed to selective pressure 
induced by human domestication, as it appears that 
they are largely dispensable in laboratory strains 
(such as &S288C) which are no longer used in 
fermentation processes. Nonhomologous recombi- 
nation processes may account for the duplication of 
these and other genes residing in subtelomeric 
regions (see, for example ref. 91), reflecting the 
dynamic structure of yeast telomeres in general [79]. 
We have already mentioned the fact that the 
subtelomeric regions of several yeast chromosomes 
share highly conserved segments, in some instances 
up to 30kb, which carry duplicated genes the 
functions of which are largely unknown. 


30.6.1.2 Duplicated genes in internal 

chromosome regions 

Analyses of individual chromosomes indicate 
a great variety of genes duplicated elsewhere in 
the chromosomes. Before complete chromosome 
sequences became available, a great variety of genes 
were known to occur in two or more identical, or 
nearly identical, copies located on different chromo- 
somes, such as the histone genes, ribosomal protein 
genes, genes for ATP/ADP carriers, for enzymes 
of the glycolytic pathway, for sugar and amino 
acid transporters, and for many other proteins. 
Numerous examples can now be added when the 
completed chromosomes are searched for similarity 
at the nucleotide as well as at the protein level. 
These include dispersed families with related but 
nonidentical genes scattered singly over many 
chromosomes. The largest such family comprises 
the 23 PAU genes which specify the so-called 
seripauperines [92], a set of almost identical serine- 
poor proteins of unknown function. The PAU genes 
reside in the subtelomeric regions. Clustered gene 
families are less common, but a large family of this 
type occurs on chromosome I where six related 
genes encode a set of membrane proteins of un- 
known function [92]. Another 10 members of this 
family occur on five additional chromosomes; some 


are clustered, others are scattered singly, still others 
are located in subtelomeric regions. 


30.6.1.3 Duplicated genes in clusters 
Remarkably, duplicated genes have also been found 
in clusters. There are at least three examples of this 
kind in chromosome II [28]. Another case is a cluster 
of three hexose transporter genes on chromosome 
VUI [18], which appear to be the result of a less 
recent gene duplication. Rather unique cases of gene 
duplications are represented by a large clustered 
(tandem) gene family of membrane proteins on 
chromosome I, and a large cluster on chromosome 
VII near CUP1. The CUP1 gene-coding copper 
metallothionein is contained in a 2-kb repeat that 
also includes an ORF of unknown function. The 
repeated region has been estimated to span 30 kb in 
strain 05288C, which could encompass 15 repeats, 
but the number of repeats varies among yeast strains. 
However, in these and other cases, the duplicated 
sequences are confined to nearly the entire coding 
region of these genes and do not extend into the 
intergenic regions. Thus, the corresponding gene 
products share high similarity in terms of amino 
acid sequence or sometimes are even identical and 
therefore may be functionally redundant. However, 
as suggested by sequence differences within the 
promoter regions, gene expression should vary 
according to the nature of the regulatory elements or 
other (regulatory) constraints. It may well be that 
one gene copy is highly expressed while another one 
is weakly expressed. Turning on or off expression of 
a particular copy within a gene family may depend 
on the differentiated status of the cell (such as 
mating type, sporulation, etc.). Biochemical studies 
also revealed that, in particular cases, ‘redundant’ 
proteins can substitute each other, thus accounting 
for the fact that a large portion of single gene 
disruptions in yeast do not impair growth or cause 
‘abnormal’ phenotypes. This does not imply, 
however, that these ‘redundant’ genes were a priori 
dispensable. Rather they may have arisen through 
the need for yeast cells to adapt to particular 
environmental conditions. These notions are of 
practical importance when carrying out and inter- 
preting gene disruption experiments. 


30.6.1.4 Cluster homology regions 

An even more surprising phenomenon became 
apparent when the sequences of complete chromo- 
somes were compared to each other, revealing that 
there are large chromosome segments in which 
homologous genes are arranged in the same order, 
with the same relative transcriptional orientations, 
on two or more chromosomes. The occurrence of 
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such cluster homology regions (CHRs) is now 
manifest for a great deal of the yeast genome and 
might account for some 30-40% of total redundancy 
[5,93]: 

Chromosomes II and IV share the longest CHR, 
comprising a pair of pericentric regions of 170 and 
120 kb, respectively, that share 18 pairs of homo- 
logous genes (13 ORFs and 5 tRNA genes). The 
genome has continued to evolve since this ancient 
duplication occurred: the insertion or deletioin of 
genes has occurred, Ty elements and introns have 
been lost and gained between the two sets of 
sequences. In all, at least 10 CHRs (shared with 
chromosomes II; V, VIII, XII, and XIII) can be 
recognized on chromosome IV. Remarkably, the 
entire chromosome XIV can be subdivided into 
several segments that are found duplicated on other 
chromosomes. 

To analyse the extent and pattern of redundancy 
in the yeast genome, a potent data structure, the 
HPT, has been developed at MIPS, allowing an 
allagainst-all comparison of fixed size blocks of 
nucleotides, the results of which can be visualized 
by a graphical interface showing similarities both at 
the nucleotide and the protein level (data are 
obtainable from K. Heumann & W. Mewes at 
http: / /www.mips.biochem.mpg.de/mips/yeast). 


30.6.1.5 Redundancy and gene organization 
In all, we can imagine two ways in which dupli- 
cations may have arisen. First, some of the dupli- 
cated genes could represent processed genes that 
were inserted into the genome relatively recently; 
a view which is consistent with the conservation 
of sequence only in the coding regions. However, 
all of these cases would appear to be created by 
integration of full-length complementary DNAs, 
because none appears to be a pseudogene, and this is 
unexpected in this model. In addition, some of the 
homologous gene pairs include introns in both 
genes, which suggest that these genes at least were 
not duplicated by this mechanism. Second, the 
clustering of duplicated genes and the occurrence of 
extended regions of similarity compel us to consider 
the idea that entire genomic regions were dupli- 
cated. Several of these duplication events would 
appear to be ancient, because the DNA sequence has 
clearly diverged outside the coding regions; more- 
over, such clusters even share a number of tRNA 
genes both in the same location and orientation. 
However, duplications may occur at any time 
during evolution (see reference to Heumann and 
Mewes, above). 

An interesting problem intimately related to 
evolution is the origin of the present organizational 
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pattern of genes. Could we, for example, find any 
criteria, be it structural or functional in nature, that 
govern the regional arrangement of particular genes? 
In other words, is there an ‘ordered grouping’ 
of genes along the yeast chromosomes or are we 
left with a random succession of most of the genes? 
Presently, we have few clues to answer these 
questions. In some respects, however, we do have 
indications for a particular location or grouping of 
genes, and some examples have already been 
mentioned above. In several cases, highly expressed 
genes are found associated with ARS elements, so 
that one could speculate that replication and 
efficient transcription is intimately coupled. In 
chromosome XI, it appears that highly expressed 
genes occur in ‘clusters’ within preferred regions 
(B. Dujon, personal communication). Clearly, the 
MAL and SUC loci, and the GAL locus represent 
examples, in which functionally related genes 
involved in a particular metabolic pathway are 
closely associated with each other. 


30.6.2 Sequence variation among yeast strains 


The question of to what extent yeast strains differ 
with respect to their genetic content has implicitly 
been touched already. We have discussed a number 
of features that contribute to polymorphisms in 
different yeast strains: (i) variable number of gene 
copies from repeated gene families; (ii) individual 
patterns caused by the presence or absence of 
particular Ty elements; and (iii) plasticity of the 
chromosome ends. In all these cases, polymorphism 
becomes also manifest through length differences 
between corresponding chromosomes. In addition, 
excisions or inversions of particular gene regions 
have been observed to give rise to polymorphisms. 
Chromosome breakage has been found to occur in 
yeast, resulting in karyotypes deviating from the 
‘normal’ picture. However, sequence variations 
within the coding regions of individual genes seem 
to be rare, as far as we can tell from comparisons of 
the homologous sequences obtained from different 
strains. 


30.6.3 Other genomes 


30.6.3.1 The human-—yeast connection 

The availability of the complete yeast genome 
sequence not only provides further insight into 
genome organization and evolution in yeast but 
extends the catalogue of novel genes detected in this 
organism [93]. Many of these may be of particular 
value to yeast molecular biologists only, but of 
general interest may be those that are homologues 
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to genes that perform differentiated functions in 
multicellular organisms or that might be of rele- 
vance to malignancy. Although the roles of these 
genes have still to be clarified, yeast may offer a 
useful experimental system to identify their func- 
tion. On the other hand, the wealth of information to 
be expected clearly demands that new routes are 
explored to investigate the functions of novel genes. 

By comparing the catalogue of human sequences 
available in the databases with the ORFs on the 
completed yeast chromosomes at the amino acid 
level it is estimated that more than 30% of the yeast 
genes have homologues among the human genes. 
As expected, most of the genes of known function 
catagorized in this way represent basic functions in 
both organisms. More similarities become apparent 
when expressed sequence tags (ESTs) are included 
in the analysis. Undoubtedly, the most compelling 
protagonists among these homologues are yeast 
genes that bear substantial similarity to human 
‘disease genes’. Recently, a comparative study along 


these lines has been published (obtainable from 
http://www.ncbi.nlm. gov/XREFdb/). Table 30.2 
summarizes these findings. 


30.6.3.2 Other model organisms 

Prior to the release of the complete yeast genome 
sequence, two complete bacterial genomes had been 
published [49,50]; another prokaryotic genome was 
released recently [51]. The sequences of several 
further bacterial genomes have apparently been 
completed and the sequences of a number of 
bacteria, mostly extremophiles, are under way. The 
genome sequence of E. coli is now completed (see 
Chapter 31) and Bacillus subtilis will be completed 
soon (reviewed in ref. 5). The genome sequences of 
the next two eukaryotic genomes, those of Schizosac- 
charomyces pombe and C. elegans (see Chapter 29), are 
within our reach; the systematic sequencing of 
larger model genomes, most notably Drosophila 
melanogaster (see Chapter 28) and Arabidopsis thaliana 
(see Chapter 33), has now been tackled. Undoubt- 


Table 30.2 Human disease genes with similarity to yeast genes. The positionally cloned genes are listed in order of 
decreasing statistical significance of the best match in the databanks [100]. 
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Human disease Humangene Yeastgene Yeast protein description 
Hereditary non-polyposis colon cancer MSH2 MSH2 DNA mismatch repair enzyme 
Hereditary non-polyposis colon cancer MLH1 MLH1 DNA mismatch repair enzyme 
Cystic fibrosis CFTR NACE Cadmium resistance protein 
Glycerol kinase deficiency GK GUT1 Glycerol kinase 

Bloom syndrome BLM SGS1 Mismatch repair enzyme 
Adrenoleukodystrophy, X-linked ALD PALI Phenyl ammonia lyase 

Ataxia telangiectasia ATM TEL1 Telomere associated gene, chr. II 
Pleiotrophic lateral sclerosis SOD1 SOD1 Superoxide dismutase 

Myotonic dystrophy DM YPK1 cAMP-dependent protein kinase 
Lowe syndrome OCRL YILO02 Putative IPP-5-phosphatase 
Neurofibromatosis, type 1 NF1 IRA2 Inhibitory regulator of ras-cAMP, chr. II 
Choriodermia CHM GDI GDP dissociation inhibitor 
Diastrophic dysplasia DTD SUL1 Sulphate transport protein 
Lissencephaly LIS1 MET30 Methionine pathway factor 
Thomsen disease CLC1 GEF1 Chloride channel protein 

Wilms’ tumour WT1 FZF1 Sulphite resistance protein 
Achondroplasia FGFR3 IPL1 Protein kinase 

Menkes’ disease MNK PCA1 Copper-transporting ATPase, chr II 
Multiple endocrine neoplasia 2A RET CDC15 Cell division control protein 15 
Duchenne muscular dystrophy DMD MLP1 Myosin-like protein 

Aniridia PAX6 PHO2 Regulator in phosphate metabolism 
Gonadal dysgenesis SRY ROX1 Hypoxic function transcription repressor 
Breast cancer, early onset BCRA1 RAD18 DNA repair protein 

Epidermolytic palmoplantar keratoderma KRT9 MLP1 Myosin-like protein 

Wardenburg syndrome PAX3 RPB1 RNA polymerase, subunit 10 
Familial polyposis coli APC AMYH Adenylate cyclase 
Neurofibromatosis, Type 2 NF2 YNL161 Putative protein kinase 
Retinoblastoma RBI (Gye Regulator of O,-dependent genes 
Wiskott—Aldrich syndrome WASP CLA4 Protein kinase 

Xerodermal pigmentosum RAD27 YKL113 Nucleotide excision repair enzyme 
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edly, the information accumulating from these 
projects will provide an important foundation to 
learn more about the extent to which various 
processes are conserved among organisms in differ- 
ent lineages [95]. 


30.7 Perspectives and conclusions 


30.7.1 Functional analysis 


From the beginning, it was evident to anyone 
engaged in the project that the determination of the 
entire sequence of the yeast genome should only 
be considered a prerequisite for functional studies 
of the many novel genes to be detected [96,97]. 
Following these considerations, a European project 
called EUROFAN (for European Functional Analy- 
sis Network) has been established to undertake a 
systematic functional analysis of the functions of 
novel yeast genes [96]. Similar activities are under 
way in Germany, Canada and Japan, and in the USA, 
initiatives have been started by the NIH. For 
EUROFAN, a first goal will be to systematically 
investigate the phenotypes resulting from disrup- 
tions (and possibly overexpression) of some 1000 
yeast genes of unknown function. A special set of 
yeast strains has been constructed for this purpose 
using a PCR-mediated gene replacement technique 
for the deletion of individual genes [98]. Concur- 
rently, complete transcriptional maps of entire 
chromosomes will be constructed. Likewise, the 
development of refined in silicio analysis methods 
will be used to improve prediction of function (see, 
for example, ref. 99). These data are then used as a 
basis for intensified functional analyses: relevant 
genes or groups of genes that are suggested to be 
involved in particular functions are attributed to 
consortia of specialized laboratories for further 
exploitation. 


30.7.2 Outlook 


The wealth of fresh and biologically relevant 
information collected from the yeast sequences and 
the functional analyses have an impact on other 
large scale sequencing projects. The important 
contribution of genome projects in determining 
gene function has begun to emerge. Clearly, those 
genes that are homologues to genes that perform 
differentiated functions in multicellular organisms 
or that are of relevance to malignancy will remain as 
being of outstanding importance. Given the high 
evolutionary conservation of a multitude of basal 
functions from yeast to man and the experimental 
advantages of the yeast system, it will be of great 


benefit to combine these potentials to assist the 
human and other genome projects. 
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Appendix 


A number of informational resources on yeast are 
available in the Internet [103]. Valuable information 
coupled with data libraries, routines for searches 
and data retrieval can be found in several home 
pages on the World Wide Web. Sequence data can 
also be retrieved from various databases by ftp or 
e-mail; likewise, these resources offer information 
and special services via e-mail addresses. 


Information on the yeast genome and related 
projects, search and query facilities 


http:/ /www.mips.biochem.mpg.de/yeast/ 

http:/ /www.embl-ebi.ac.uk 

http:/ /www.sanger.ac.uk/yeast/home.html 
http://genome-www:stanford.edu/saccharomyces/ 
http://www.nig.ac.jp 

http: / /www.ncbi.nlm.nih.gov/ 

http:/ /www.ncbi.nlm.nih.gov /XREFdb 

http: / /quest7.proteome.com/YPDhome.html 
http://expasy /hcuge.ch/cgi.bin/list?yeast.txt 


E-mail addresses 


mewes@mips.embnet.org (information on _ the 
European yeast project) 

barrell@sanger.ac.uk (information on the yeast 
project at Cambridge) 

linder@urz.unibas.ch (information on ListA, a yeast 
gene catalogue) 

yeast-curator@genome.stanford.edu 
on the American yeast project) 

NetServ@ebi.ac.uk (general information on data 
bases) 

DataLib@ebi.ac.uk (general information on data 


bases) 


(informatio 


Data retrieval by ftp 


ftp:/ /mips.embnet.org/ yeast / 
ftp://ftp.ebi.ec.uk/ pub/databases/yeast 
ftp://genome-ftp.stanford.edu/yeast/genome_seq 
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31.1 Introduction 


There can be few practising molecular biologists 
who have not made use of Escherichia coli, its 
plasmids or bacteriophages, even if only as tools to 
aid in the analysis of genes originating from other 
species. Yet today E. coli, for several heady decades 
itself the focus of genetic and molecular analysis, no 
longer occupies centre stage: most recently trained 
molecular biologists have interests focused on one 
or other eukaryote. In consequence, the elegance of 
E. coli’s molecular and genetic systems are a closed 
book to many who might profit from their use. 
We will try to partly redress this state of affairs here. 
We hope to do this by providing, in addition to 
information on the availability of sequences, 
both an overview of E. coli and its genome, and 
an up-to-date compendium of information which 
will serve to familiarize the readers of this 
book with, or remind them of, both the simplicities 
and the complexities of this versatile micro- 
organism. 


31.2 Escherichia coli as 
a model system 


Escherichia coli offers its dedicated constituency the 
tantalizing possibility of fully defining the elements 
and processes that make up a simple free-living 
organism. The life cycle of E. coli is superficially 
uncomplex. The genome is replicated with a con- 
comitant doubling in mass and lergth of the cell 
(fuelled by an intricate, but dissectable, interme- 
diary metabolism). These events culminate in the 
division of the parent cell into two daughter cells. 
If this were a satisfactory paradigm, E. coli could 
be fully defined by dissecting its machinery of 
replication, protein synthesis, division and energy 
metabolism. It has become abundantly clear, how- 
ever, that such a view is a gross oversimplification. 
Not only is E. coli capable of altering its metabolism 
and composition to cope with growth under a wide 
variety of conditions (free-living, life in the gut, 
aerobic and anaerobic states, conditions of plenty 
and starvation, etc.) but it is also supplied with a 
selection of inducible systems that enable it to 
respond to a panoply of possible stresses. This it 
does with alterations in composition, and with the 
production of groups of proteins whose function it 
is to repair and limit stress-induced damage. Two 
such proteins, active in DNA repair, have recently 
received attention because of their homology with a 
human protein associated with susceptibility to 
colon cancer [1,2], demonstrating, yet again, that E. 
coli continues to be a useful model for previously 


Gene nomenclature 


Escherichia coli genes are identified by four-letter italicized 

names. The first three letters are a lower-case mnemonic 
| describing gene function or mutant phenotype. The fourth 
distinguishes genes which share a mnemonic and is 
| assigned alphabetically as genes are discovered (when the 
appropriate information was available, genes have been 
named to denote the order of their activity in a biochemical 
pathway). Early names relate to auxotrophies (arg, thr 
| mutants require the amino acids arginine or threonine for 
growth), inability to ferment (/ac mutants cannot use 
lactose as a sole carbon source), or drug-resistance (amp 
“mutants are ampicillin resistant). Conditional mutants in 
essential processes such as macromolecular synthesis are 
named accordingly: thus dnaA-dnaxX strains are defective 
in DNA synthesis and fts mutants form filaments because 
they have division defects. A new mutation with a parti- 
cular defect is given an allele number; dnaA52 will have 
been assigned to the dnaA gene. If a gene has not been 
assigned and only phenotypic information is available, a 
phenotypic designation is properly used. Dna-52 could be 
the phenotype of a strain with an otherwise uncharac- 
terized defect in DNA synthesis while dna-52 could 
designate a mutation in one of several dna genes which has 
not yet been assigned. By convention a mutant strain is 
described with the mnenomic (i.e. a /eu strain and the 
nonmutant parent, if necessary, as /eu*). 


Locations, such as transposon insertion sites, are assigned 
names with a similar form; these all start with z. The 
succeeding letters give positional information; a-j in the 
second position indicate the 10-min interval in which the 
site is located, the third position provides for subdivision 
into minutes. Thus zbd indicates 13-14min, and zej 
49-50 min. A similar system has come into use to indicate 
the positions of ORFs of unknown function identified 
during sequencing. These are assigned names starting with 
y with the following two letters giving positional infor- 
mation. For these putative genes the fourth letter is also 
assigned and here distinguishes multiple ORFs in the same 
region. 


Nomenclature is discussed at length each year in the first 
(January) issue of the Journal of Bacteriology. 


uncharacterized eukaryotic processes, including 
some linked to inherited human disease. 

Although E. coli is less simple in its responses than 
once thought, it still leads the field as the first 
organism most likely to be fully understood. This 
remains true despite the fact that the smaller 
genomes of several other bacterial species were fully 
sequenced earlier (see Section 31.9 for a fuller 
discussion). The reasons for this are twofold. First, 
E. coli is genetically tractable. Its genome of 4.7 
megabases (Mb) has at last been fully sequenced 
(completed in January 1997 [182]). Its well-defined 
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genetic systems make easy the identification of 
genes and their products, and facilitate the analysis 
of the effects of specific mutations in, or deletions of, 
individual genes. New physiologic information can 
easily be collected using simple and well-tested 
methodologies. Second, there is an immense base of 
accumulated specific knowledge, its accessibility 
much increased by the publication, in 1987, of the 
encyclopaedic two-volume work, Escherichia coli 
and Salmonella typhimurium: Cellular and Molecular 
Biology [3]. This work has recently been revised and 
extended. The second edition includes not only 
genetic and physical maps, but a gene—protein and 
gene-function index to supplement the 155 
information-dense chapters describing all aspects of 
these related organisms. 

The earliest work with E. coli utilized a variety of 
different strains and isolates. However, the dis- 
covery of conjugation in the K-12 strain (isolated 
from a human patient in the 1920s) by Tatum and 
Lederberg in the 1940s [4] followed by the 
identification in the 1950s of derivatives able to 
transfer chromosomal markers with high frequency 
(Hfr donors) [5, 6] led to the adoption of E. coli K-12 
as the strain of preference for subsequent genetic 
studies. The original K-12 isolate was both lysogenic 
for the bacteriophage A and harboured the F- 
plasmid. In a series of efforts to isolate mutant 
derivatives the original strains were subjected to 
numerous genetic insults, including repeated irra- 
diations with X-rays and ultraviolet, and treatment 
with assorted chemical mutagens. These treatments, 
coupled with continual passages and selections in 
many laboratories, have given rise to stocks of 
considerable diversity [7], even though all retain the 
specific modification and restriction system that 
defines K-12 strains [8]. Strains now described as 
‘wild-type’ K-12 strains are usually prototrophic 
and cured of lambda and F. However, their precise 
histories differ [7] and considerable variations in 
restriction patterns are not uncommon [9-13]. It is to 
be anticipated that some of the substantial differ- 
ences in genetic composition found amongst native 
E. coli isolates [14] will have been acquired second- 
arily by K-12 cultivars and may well account for 
at least some of the heterogeneities found in the 
sequence databases. 


31.3 The genome of Escherichia coli 


The basic genome of E. coli is a single circular 
chromosome (Fig. 31.1), 4.64 Mb in length (GenBank 
entry U00096, see ref. 15 for a brief overview, ref. 16 
for a symposium volume, ref. 17 for a compre- 
hensive handbook and ref. 170 for maps). The 
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Fig.31.1 Schematic representation of the E. coli circular 
chromosome. The outer ring shows chromosome lengths 
in kbp. The central ring shows the location of the 
replication origin and directional termination sites. The 
histogram shows the numbers of genes mapped to each 
minute of the chromosome on Edition 9 of the linkage 
map [170]. 


genome may be augmented, in particular strains, by 
integrated elements such as lysogenic bacterio- 
phages (the best known of these, A, is 48.5kb in 
length, and was present in the original K-12 isolate, 
as described above) or plasmids, as in Hfr strains, 
in which the F-plasmid is integrated into the 
chromosome. Alternatively, or in addition, E. coli 
may harbour extrachromosomal genetic elements. 
Plasmids, autonomously replicating, circular DNA 
molecules (see refs 18-21 for reviews) range in size 
from about 5kb (i.e. ColEl, the prototype of the 
group that has yielded multicopy cloning vectors) 
to about 100kb (F). The larger plasmids are charac- 
teristically self-transmissible, encoding elaborate 
transfer systems which include hairlike cell-surface 
appendages, or pili, that promote conjugation. The 
smaller plasmids are adapted to parasitize the 
transfer systems of cohabiting larger plasmids and 
are thus also transmissible. Certain bacteriophages, 
notably P1 [22], adopt a plasmid form on lyso- 
genization, but rely on lysis and reinfection as a 
method of spreading to new hosts. The larger 
plasmids are generally maintained at 1-5 copies per 
host chromosome, the smaller at 15-20 copies. Thus 
an E. coli cell which is host to several types of 
plasmid can easily have its DNA content signifi- 
cantly increased in consequence. 
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The smaller plasmids are single replicons, repli- 
cating from a particular origin, most often unidi- 
rectionally. Replication of these plasmids requires 
only host proteins. The larger plasmids tend to be 
chimaeras composed of more than one replicon; 
each intact replicon usually encodes a ‘Rep’ protein 
essential both to initiate replication, and to control 
replication frequency. Replication of plasmids may 
be uni- or bidirectional. In contrast, the chromosome 
itself is a single replicon, replicating bidirectionally 
[23,24] from an origin located at 84 map units; 
termination of replication occurs when the forks 
meet at a position approximately opposite to the 
terminus. Termination (reviewed in refs 25 and 26) is 
confined to the terminus region by the action of 
several orientated ter sites which prevent replication 
forks from proceeding through the terminus region 
back toward the origin. These sites act by binding a 
protein, Tus, which impedes the strand-separating 
action of the DnaB helicase. Their directionality 
arises from the fact that DnaB travels ahead of the 
replicating fork on one strand only. Replication is 
initiated once per cell division cycle when the 
initiation mass is attained; the details of replication 
and its control have been extensively reviewed 
[27-34]. 

The extended E. coli chromosome is 1000 times 
longer than the cell that contains it [35]. Prokaryotes 
lack a nuclear compartment; despite this the E. coli 
chromosome is not evenly dispersed throughout the 
cell but is centrally located within it [36]. It forms a 
discrete structure, termed the folded chromosome or 
nucleoid, which can be isolated intact from the cell 
(see refs 37-39 for further reviews). Nucleoid 
preparations lack the viscosity of unfolded DNA 
and have been observed to contain chromosomes 
with unstable beaded structures reminiscent of 
nucleosomes (see Fig.31.2) [40]. Although E. coli 
does possess a variety of small, basic, histone-like 
DNA-binding proteins [41-43], none of these has 
been found to be essential for cell survival [42,44], 
nor has any been convincingly demonstrated to 
fulfil a histone-like role in vivo. The way in which E. 
coli DNA is compacted within the cell thus remains 
far from fully understood. 

All circular DNAs in E. coli are maintained in an 
underwound state by the opposing actions of 
topoisomerases that introduce and remove turns in 
the helix [45]; thus circular DNAs are found to be 
supercoiled when examined in vitro. The intact 
chromosome (as recovered from the cell in folded 
form) has been estimated to be organized into 
50-100 separate domains of supercoiling (see refs 
37-39 for reviews). Nascent RNA and associated 
protein appear responsible for maintenance of these 
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domains but no specific protein has been found to be 
essential. Supercoiling is likely to be responsible for 
some but not all of the compaction of the DNA. The 
chromosome contains several hundred similar, but 
not identical, intergenic sites (REP sequences, to be 
discussed in Section 31.6.4.2) of unknown function. 
These sites are able to bind DNA gyrase and DNA 
polymerase I (see ref. 46 for references) and, although 
it is attractive to suppose that REP sequences may 
have a role in maintaining chromosome structure, 
evidence is still lacking. The chromosome has also 
been reported to be associated with the membrane, 
again at 50-100 sites [47]. No specific chromosomal 
sequences (except possibly the replication origin 
[48]) have been identified as preferentially associ- 
ated with the membrane [49], and it is thus thought 
that the interaction between chromosome and 
membrane is a dynamic one, and probably does not 
have a role in maintaining chromosome structure. 


31.4 The physical map 


The first steps toward constructing a physical map 
were taken in 1975 when Clarke and Carbon [50] 
prepared a hybrid plasmid library by AT-tailing 
mechanically sheared coli DNA and ligating it into 
ColEl. They collected 2200 independent clones to 
form a bank which, it was hoped, would contain the 
entire genome. These clones have been separately 
numbered, but, even today, not all have been 
characterized [51]. It soon became clear that some 
sequences were absent from the Clarke—Carbon 
bank. This has been attributed to the fact that certain 
genes are lethal when cloned in high copy number. 
However, the bank has, none the less, proved very 
useful as a source of complementing cloned DNA 
and as the starting material for deriving the 
gene—protein index [52]. More recently, other cosmid 
banks have been constructed and characterized 
[13,53] with genome coverages of about 70% and 
95%, respectively. Table 31.1 lists some available 
clone banks of E. coli DNA. 

In 1987, Kohara constructed a A library which 
included almost all the DNA of the K-12 strain 
W3110 [54]. Because 4 does not need to be 
maintained as a high copy number plasmid, the 
prospects for cloning the entire genome were good, 
although in fact, eight small regions proved to be 
missing from the original clone set. A later cosmid 
library, based on a low copy number plasmid origin, 
filled in these gaps [53]. This demonstrated that, 
although perhaps poorly maintained when cloned 
in high copy number, these were clonable regions. The 
Kohara library was used to construct a restriction 
endonuclease map, measured in kilo-bases (kb), of 
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Fig.31.2 The E. coli chromosome visualized. (a) E. coli 
cells, photographed with phase contrast and fluorescence 
microscopy. Nucleoids have been condensed by 
treatment with chloramphenicol (200 pg ml“, 5 min) and 
stained with DAPI. Photograph courtesy of K. Begg. (b, c) 
Electron micrographs of sectioned whole cells. The small 
spheres are ribosomes. Part b shows conventionally fixed 
cells with sharply demarcated condensed chromatin with 
distinct fibres. In part c, cells have been cryofixed and 
freeze-fractured; the DNA is more diffuse, occupies a 
greater volume, and has a finer fibrillar structure than 
seen in part b. Note that the cell envelope also seems free 
from distortion. Photographs courtesy of E. Kellenberger. 


(e) 


For further discussion, see [36]. (d) DNA with an 
nucleosome-like appearance emerging from a cell which 
has been briefly lysed in 1% Triton X-100 directly onto the 
electron microscope grid. Reproduced, with permission, 
from [40]. (e) DNA derived froma single lysed cell, 
possibly held together by remaining membrane. The 
DNA is in supercoiled loops; each loop may represent a 
domain of supercoiling. This image is copyrighted as 
‘Bluegenes #1’ 1983 with all rights reserved by 
DesignerGenes Posters Ltd, PO Box 100, Del Mar CA 
92014, USA, from which posters and T-shirts are 
available. With permission of R. Kavenoff. Scale bars 
represent 5 pm in part a; 1 jum in parts b-e. 
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Table 31.1 Some available E. coli clone banks. 


COOPES RH OSEH SEAT SHODESEO EE HEOHES 
Suorevecsescoseseoeaa 


ee ee re ere ee eat Se eel le so od Weert fs phe Ne th 


Type Description Source Reference 
i (Kohara) Ordered overlapping collection of 476 clones Kohara?* [54] 
of W3110 DNA covering all but 7 small gaps 
i (Rose) 410 clones of MG1655 DNA; 7 recorded gaps, E. coli genome 
some abutments project, University’ Personal 
of Wisconsin communication, 
see ref. 132 
Phagemid library W3110 DNA in SE6; NR1 origin allows ATCC (purchase)< [169] 
maintenance as low copy number plasmid, 
unordered 
Clarke—Carbon Sheared DNA in ColE1; 2200 individual clones CGSC! (759 clones); [50,51] 
plasmid collection (15 kb inserts), partial sets and characterized GSRC*(518 
clones available characterized, 50% 
coverage, can 
supply whole set) 
Cosmids (Tabata) W3110 in pHC79; 40 kb inserts, 325 clones, GSRC and 53 [53] 
70% coverage 
Cosmids (Birkenbihl) Strain BHB2600; 95% coverage, R. Birkenbihl’s [13] 
570 clones, 12 small gaps 
Cosmids(Knott) W3110 DNA; a low copy number cosmid, Only ‘gap-closing’ 


pOU61cos, was needed to close 3/8 gaps in 


a conventional cosmid bank 


cosmids availables [55] 


@Dr Y. Kohara, National Institute of Genetics, Mishima, Shizuoka-ken 1111, Japan. Fax: +81-559-81-6826. 
>E. coli genome project, University of Wisconsin-Madison, Laboratory of Genetics, 445 Henry Mall, Madison WI 53706, 


USA. FAX: +608 263 745. E-mail: ecoli@genetics.wisc.edu. 


¢ American Type Culture Collection, 12301 Parklawn Drive, Rockville, MD 20852, USA. WWW:http:/ /www.atcc.org /. 
4 F. coli Genetic Stock Center, Department of Biology, 3550ML, Yale University, PO Box 208104, New Haven, CT 06520- 
8104, USA. Fax: +203 432 3852. E-mail: berlyn@cgsc.biology.yale.edu. Also see 31.5.1.3. 

eDr A. Nishimura, Genetic Stock Research Center, address as note a. Fax: +81-559-81-6826. 

‘Dr R. Birkenbihl, Department of Genetics, University of K6ln, Ziilpicher Strasse 47, 50674 K6In, Germany. 

Please note that these collections are preserved by individual researchers unequipped for large-scale distributions. 


the entire chromosome; the map shows the cutting 
sites for eight commonly used 6-bp cutters and the 
positions from which the insert of each of the clones 
in the Kohara clone library originates. The locations 
of certain genes which had been both genetically 
and physically mapped were used to align the 
physical with the genetic map. (At about the same 
time, a restriction map derived from pulsed field gel 
electrophoresis was published [9] showing sites for 
several rare cutters. Although they have proved of 
less use than the Kohara map in correlating the 
genetic and physical maps, maps of this sort have 
proved useful for the analysis of strain differences 
ona macro scale [9-13].) 

The publication of the Kohara map and the 
subsequent availability of the ordered miniset of A- 
clones revolutionized genetic mapping (see below) 
and provided a framework on which genes could 
be placed to construct a true physical gene map. 
Sequenced regions long enough to contain several of 
the mapped restriction sites could be assigned to 


unique locations on the physical map as could 
cloned genes for which a regional restriction map 
was available. Other genes were physically mapped 
by hybridization of cloned DNA to DNA prepared 
from miniset clones (see below). A section appeared 
in the Journal of Bacteriology from 1989 to 1993 
dedicated entirely to short reports of physical map 
locations. 

Several groups undertook the analysis of 
sequences in the databases or in the literature to 
compile maps, most notably those of Danchin [56] 
and Rudd [57]. Ecomap5, compiled by K. Rudd [58] 
is reproduced in a very useful two-volume lab- 
oratory manual and handbook of coli genetic data 
assembled by J. Miller [17]. Ecomap5 shows the 
locations of database entries on the physical map but 
does not attempt to delimit individual genes. A 
section from this map is included in Fig. 31.3. Later 
versions of the physical map have been compiled 
which do show the exact positions of individual 
genes; Ecomap6 is discussed, but not reproduced, in 
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ref. 59; Ecomap 7 appears in ref. 170. It is intended 
that the most recent Rudd version will be available 
electronically from the NCBI, as will be software for 
its use. A compilation of E. coli sequences extracted 
from the data bases by M. Kroger and colleagues has 
now been upgraded into a map which can be 
interactively accessed on the World Wide Web. This 
is discussed in more detail in Section 31.8. 

Although this piecemeal placement of infor- 
mation on a developing physical map could be 
expected to result eventually in a complete map, the 
development of a complete map has been greatly 
accelerated by the systematic sequencing projects 
which will be described below. Analysis of the data 
from systematic sequencing allows all potential 
open reading frames to be identified, delimited, and, 
frequently, to be equated with previously identified 
and mapped genes which had either not been 
sequenced or not located on the physical map. An 
example of analysed sequence is shown in Fig. 31.3. 


31.5 Genetic mapping in 
Escherichia coli 


31.5.1 Classical methods: 
conjugation and transduction 


Classical methods of mapping will be defined here 
as those that do not require the use of restriction 
enzyme technology. Classical mapping in E. coli 
relies on two methods of gene transfer, conjugation 
and transduction. A recent volume of Methods in 
Enzymology describes these techniques and others to 
be described below; it includes protocols [60]. 


31.5.1.1 Mapping by conjugation 

Strains in which the F-plasmid is integrated into the 
chromosome are called Hfr strains. When mixed 
with strains that lack F (F, or female, strains), Hfr 
strains will initiate the formation of mating pairs 
and transfer a single-stranded copy of the chro- 


mosome, in an orientated fashion, beginning at the 
site of insertion of the F-factor; this process is called 
conjugation. Mating pairs usually separate before 
transfer is complete, but when complete transfer 
does occur, it requires about 100 min. For this reason 
the coli chromosome is divided into 100 map units 
termed minutes. (Conjugational crosses using a 
collection of donors differing in the site of F 
integration and orientation of transfer, supplied the 
initial evidence that the chromosome is circular 
[61].) 

When conjugation is used for mapping [62], a 
mutant recipient is usually crossed with a non- 
mutant donor; selection of progeny is accomplished 
by selecting for transfer of the desired character and 
counter-selecting against the donor, usually with a 
drug (streptomycin and nalidixic acid are popular) 
to which the recipient has been made resistant. If the 
site of insertion of F on the donor chromosome is 
known, the time after mating at which the selected 
allele is first transferred indicates its position on the 
donor chromosome. Reasonably accurate estimates 
of position require Hfrs that transfer the desired 
allele early, and thus mapping of an unknown gene, 
which might be anywhere, requires the use of sets of 
Hfrs with different origins of transfer. Sets of Hfrs 
are described in refs 17 and 62 and are available from 
the E. coli Genetic Stock Center (CGSC, see Table 31.1 
for address) and other sources. 

More recently, sets of Hfrs have been developed in 
which a transposable element (Tn10) specifying 
tetracycline or kanamycin resistance, is transferred 
at about 20 min after the initiation of mating. When 
these Hfrs are used, drug resistance is selected after 
short mating periods. Transfer of the desired marker 
is scored amongst the drug-resistant progeny and, 
when transferred, can be presumed to be linked to 
the transposon or transferred earlier. Use of these 
sets allows approximate locations of new genes to be 
determined quickly. Such a set of Hfrs is available 
from the CGSC (Wanner set [63]) or from Dr C. Gross 


Fig. 31.3 (Opposite) Escherichia coli linkage and physical 
maps. (a) Asegment from the E. coli linkage map 
published in 1996 [171]. Each gene indicated here has 
been placed by genetic techniques, sometimes 
augmented, in the case of operons, by physical analysis. 
The numbers below the map represent map minutes. The 
arrows indicate direction of transcription. (b) Part of this 
region as derived from the physical map, based on 
Kohara [54] but updated to show regions that have been 
sequenced. The restriction map has been corrected, 
where necessary, with information derived from 
sequencing. The scale above the map is chromosomal 
length in kbp, with map minutes shown immediately 
below the map. The upper set of lines below the map 


indicate the inserts from the Kohara set of A-clones and 
the lower set sequence contributions to GenBank. The 
continuous line (ggt-ecoM) indicates that this sequence is 
part of the large contig compiled by the Wisconsin 
mapping project. Contributed by K. Rudd. (c) Part of this 
segment based on information from the Wisconsin 
sequencing project which appeared in ref. 76. The A- 
clones indicated are those isolated as part of this project. 
Restriction sites: B, BamHI; G, Bgll; R, EcoRI; V, EcoRV; H, 
Hindlll; K, Kpnl; S,Pst1; P, Poull. X> marks the position of 
Chi sites; all are orientated similarly, * marks DNA bend 
sites. Potential promoters and terminators are shown 
with putative transcripts. 
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(Department of Stomatology, University of Cali- 
fornia, San Francisco, CA 941430, USA (Singer/ 
Gross set [64]). 


31.5.1.2 Mapping by transduction 

In bacteriophage P1 lysates, about 1% of phage coats 
contain fragments of chromosomal DNA, 100 kb in 
length, which have been packaged in place of phage 
genomes [22]. These can be injected into recipient 
cells and will recombine with the recipient chromo- 
some to replace homologous DNA, in a process 
termed generalized transduction. Once a gene has 
been located approximately (i.e. by conjugation) it 
can be located more precisely by scoring cotransduc- 
tion with a possible neighbouring marker (which 
must be within 100 kb or 2 min to be cotransduced). 
The closer the pair of markers the greater is their 
frequency of cotransduction; cotransduction fre- 
quency can be used to calculate genetic distance by 
applying a function derived by Wu [65]. Sets of 
donors, each with a transposon located at a different 
known position, have simplified this form of 
mapping immensely. These donor sets are also 
available from C. Gross and the CGSC. The order of 
genes very close to one another, or of point mut- 
ations within genes, were classically determined by 
using three-point transductional crosses. Sequen- 
cing and other nonclassical approaches (to be dis- 
cussed further below) have now supplanted these 
methods for fine-structure mapping. 


31.5.1.3 The genetic map 

An E. coli genetic map based on this type of study 
was first published in 1964 [66]. Subsequent editions 
of the map updated the coverage. Edition 9 includes 
about 1800 loci and is integrated with the physical 
map. The CGSC, now maintained by M. Berlyn at 
Yale University, has a collection of about 7000 strains 
which are supplied free of charge on request. Details 
of the strains in the collection can be browsed 
directly using the World Wide Web (http://cgsc. 
biology.yale.edu/top.html). Mutants for most genes 
that have been described are available, and strains 
with useful combinations of mutations can also be 
supplied. A less extensive collection is maintained 
in Japan; a catalogue of available strains can be 
obtained from Dr A. Nishimura (address in Table 
afl eld) 


31.5.2 Methods dependent on the physical map: 
exploitation of clone banks 


The Kohara restriction map, the availability of 
several ordered genome libraries, and the ease with 
which custom libraries can be constructed, has 
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revolutionized gene mapping methodologies. 
Several commonly used approaches for mapping 
new genes are described here (see also ref. 60). 


31.5.2.1 Conjugation followed by complementation 

A commonly used approach is to roughly map the 
mutation of interest by conjugation and then to use a 
selected set from an ordered clone library to achieve 
complementation or recombination. In addition to 
the Kohara set of A-phages there are now available a 
second independent A-phage set and, as mentioned 
above, several plasmid and cosmid sets with full or 
partial coverage of the genome (Table 31.1). The 
complementing sequences can be pinpointed by 
subcloning from a complementing clone. It should 
be noted that high copy number suppression of 
mutations is common in E. coli (in such cases 
overproduction of a second protein reverses the 
effects of the deficiency of the first, see ref. 68 for an 
example) and it is necessary to confirm complement- 
ation attributed to DNA cloned in high copy number 
vectors with recombinational tests. 


31.5.2.2 Direct identification of 

a complementing fragment 

The second method dispenses with classical map- 
ping entirely. A complementing clone can be sought 
in a genomic library made for the purpose (this need 
not be ordered) and the complementing fragment 
identified by comparing its physical map with that 
of the chromosome. This has been done by eye, but 
computer programs have now been devised to 
identify the chromosomal location of a particular 
restriction pattern [57,69]. Alternatively, DNA iden- 
tified as complementing can be hybridized [70] to a 
commercially available filter (Takara Shuzo Co., 
Kyoto, fax:(+8175) 2415199) which contains an 
ordered array of DNA derived from the Kohara 
clone set. Now that the full coli sequence is available, 
it is to be anticipated that the chromosomal origin of 
cloned DNA will regularly be determined by 
sequence comparison. 


31.5.2.3 Mapping an insertion mutation without cloning 
A method has been described which exploits the fact 
that REP sequences (see Section 31.6.4.3) occur at 
frequent intervals [172]. Primers specific to the 
insertion are paired with generalized primers that 
will match most REP sequences and PCR is then 
used to amplify a fragment directly from chromo- 
somal DNA. The resulting product can be cloned if 
desired or sequenced directly without cloning. 


31.5.2.4 Reverse genetics 
Finally, a method which starts with a protein pro- 
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duct rather than a mutant, and which is commonly 
used in eukaryotic systems, can be employed. To 
map by reverse genetics, the protein of interest is 
purified and N-terminal sequence obtained. The 
peptide sequence can be used to design a set of 
oligonucleotides, each of which encode it, and one of 
which must be the native sequence. In the past such 
sets have been hybridized to a library filter to 
identify the clone carrying the encoding DNA. 
Subsequent subcloning, hybridization and sequen- 
cing has been used to assign the gene to an exact 
physical map location (for an example, see ref. 71). 
Of course, now that the complete sequence is 
available, oligonucleotide ‘probes’ can be used to 
directly match a gene product associated with a 
phenotype with a previously identified coding unit. 


31.6 The Escherichia coli chromosome 


In this section we will summarize both information 
that has been available for some time and informa- 
tion derived from the analysis of data collected by 
the sequencing projects to be described in the final 
sections. The longest contig available in 1996, 
resulting from the combined systematic sequencing 
projects of the Madison, Wisconsin E. coli Genome 
Project (headed by F. Blattner) and a Japanese 
consortium, comprised a third of the genome. These 
groups have carefully analysed their sequence with 
regard not only to the identification and positioning 
of specific genes but with the goal of documenting 
global characteristics of gene and repetitive site 
arrangement. These analyses had been reported by 
1994 in six primary publications from Wisconsin 
(refs 72-77) and in two from Japan [78,79]. 
Information in this section, where not specifically 
referenced, has been quoted or derived from these 
publications. 


31.6.1 Arrangement of genes on the DNA 


The chromosome of E. coli, like those of yeast, but 
unlike those of higher eukaryotes, is densely packed 
with genes. The contiguous sequence analysed to 
1994 shows that about 85% of this DNA is likely to 
encode proteins, and another 4% to be transcribed to 
yield nonmessenger RNAs. Much of the remain-ing 
intergenic sequence is in very short stretches 
and is probably comprised mainly of sequences 
concerned with the control of transcription and 
translation. Although E. coli genes are commonly 
thought to be arranged in operons, in the regions 
where transcriptional units have been defined, only 
about two-thirds of the genes detected are likely to 
belong to transcriptional units containing two or 
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more genes. The remainder have promoters of 
their own. Although there are numerous exceptions, 
cotranscribed genes tend to be functionally related. 

Analysis undertaken as part of the sequencing 
projects listed above has allowed the assignment of 
definite functions to between 40 and 45% of the open 
reading frames (ORFs) identified. The majority of 
these correspond to previously defined or sequenc- 
ed genes, but some functions could be assigned 
because of the near identity of new ORFs with those 
encoding better studied homologues in related 
organisms; these were corroborated, in most cases, 
with genetic evidence. A further 20-25% of trans- 
lated ORFs were found to have significant amino 
acid sequence similarity to existing database entries; 
these similarities could be used to predict function. 
The remaining 30% of ORFS remained functionally 
uncharacterized. However, it now appears that 
closer analysis could well permit functions to be 
predicted for many of these as well. A recently 
published study [80], using highly sophisticated 
computer programs for protein sequence com- 
parison, found that of 2300 known and proposed 
coli proteins (60% of the estimated total), more 
than 80% could be assigned at least a probable 
function. These authors describe 66% of coli proteins 
as having known functions and another 16% as 
sufficiently similar to already characterized proteins 
to permit an assignment of function. 

Pairwise comparison of the sequences of about 
1800 coli proteins reveals that about half have some 
sequence similarity over an extended region (100 or 
more amino acids) to one or more other coli proteins 
[81,82]. They can, on this basis, be assigned to 
groups which constitute small to large families. It is 
thought that each family is an evolutionary group- 
ing whose members have diverged from an ances- 
tral protein encoded by a single ancestral gene. A 
similar analysis of 2300 proteins is corroborative 
[83]. 


31.6.2 DNA sequence: base composition and 
codon usage 


Escherichia coli DNA has an average G + C content of 
51%, but there is considerable variation in base 
composition along the length of the DNA. The G+C 
content of overlapping 8-kb segments varies be- 
tween 47 and 56%; the variation in base composition 
of 1-kb segments is even more extreme, spanning 
30-65%. This variation is much greater than ex- 
pected by chance, but currently remains unexplain- 
ed. A similar degree of variation characterizes the 
contiguous sequence in yeast chromosome III (M. 
Masters, J. Collins & A. Coulson, unpublished data). 
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Analyses have also been made of the frequency 
of occurrence of short oligonucleotide sequences 
[83,84]. Although several such sequences are over- 
or under-represented to some degree, the sequence 
CTAG is strikingly deficient, occurring at only one- 
thirtieth the expected frequency within genes. Its 
component subsequences, CTA and TA, are also 
under-represented. As might therefore be expected, 
the Arg codons AGA and AGG, theLeu codon CUA, 
and the UAG stop signal are relatively rarely used in 
E. coli [85]. It has been suggested that CTAG has been 
eliminated by selection because it could promote 
undesirable DNA bending. Consistent with this idea 
is the fact that CTAG deficiency does not extend to 
eukaryotic organisms, in which the C would be 
modified. Karlin and coworkers have analysed and 
discussed these features and other sequence inho- 
mogeneities [86,87]. The occurrence and possible role 
of CTAG has also been discussed in refs 79 and 88. 

Sequences likely to promote static bending occur, 
at irregular intervals, about every 3kb. Since they 
are more likely to be found in promoters than in 
coding sequences, a role for static bends in initiation 
of transcription can be inferred. 

Codon usage varies considerably between species 
[89] and reflects the availability of the corresponding 
tRNAs [90,91]. E. coli genes were early divided into 
two classes differing in codon usage [92,93]. Genes 
that are constitutively expressed at high levels show 
a strong codon usage bias, with a preference for 
codons recognized by the commoner tRNAs. 
Included in this class are genes that encode 
abundant DNA-binding proteins, and those whose 
products are involved in protein synthesis. Genes 
expressed at moderate or low levels belong to the 
second class, with less biased codon usage. Recent 
work has extended these observations to a larger 
group of genes and, in addition, uncovered a third 
class [85], with still less biased codon usage. Many 
proteins encoded by this third class of genes are 
located on the cell surface; others are encoded by 
insertion sequences or phage remnants. Three- 
quarters of plasmid genes analysed belong to this 
third class. It has therefore been suggested that 
genes with this distribution of codon usage did not 
evolve in E. coli, but have been acquired relatively 
recently through horizontal transmission. 


31.6.3 Arrangement of genes on the chromosome: 
some general principles 


31.6.3.1 Genes concerned with transcription and 
translation tend to be close to the origin of replication 
Because the E. coli chromosome replicates bidirec- 
tionally from a fixed origin to a terminus opposite 


[23,24], and because DNA replication requires the 
entire cell cycle for completion at moderate growth 
rates (and up to two cell cycles at high growth rates), 
the relative number of copies of any gene (its 
‘dosage’) depends on its chromosomal position vis- 
a-vis the replication origin. In cells growing expo- 
nentially in broth, genes located close to the origin 
have four times the dosage of those at the terminus 
[23]. It might therefore be expected that genes whose 
products would be in particular demand in fast- 
growing cells, could satisfy that demand to some 
degree by ‘choosing’ to be located near the origin. 
The augmented gene dosage of genes proximal to 
the origin of replication in fast-growing cells could 
at least partly relieve the need for elaborate growth- 
rate dependent transcriptional and translational 
controls [94]. 

The most obvious class of gene products required 
at relatively higher concentrations in rapidly 
growing cells are the components of the transcrip- 
tional and translational apparatus: RNA poly- 
merase subunits, the 50 or so ribosomal proteins, 
and 5S, 16S and 235 rRNAs. Ninety percentage of the 
genes for ribosomal subunits and all of those 
encoding RNA polymerase and its major o (sigma)- 
factors are located in the third of the chromosome 
surrounding the origin, consistent with the gene 
dosage hypothesis presented above. Ribosomal 
RNAs cannot of course be translationally amplified; 
in order to obtain sufficient numbers of these mole- 
cules the genes encoding them are repeated. There 
are seven copies of the rRNA genes, five of them 
within 12 map units of the origin (see map, ref. 67). 

Conversely, it might be expected that genes 
encoding inessential products, or those that are 
never needed in high quantity, would be located 
near the terminus. Consistent with this idea is the 
fact that the terminus region, although coding [95], 
is particularly variable in sequence when compared 
with related strains and species [96], suggesting that 
it contains inessential genes. This has been con- 
firmed; almost all the DNA of the terminus region, 
with the exception of sequences likely to be con- 
cerned with the termination process itself, can be 
deleted with little ill-effect [97]. The terminus also 
appears to bea repository for a variety of elements of 
extrachromosomal origin (such as A-related defec- 
tive prophages [98-100]). Although the mechanism 
by which these elements were introduced to the 
terminus region is likely to involve enhanced, 
termination-related, recombination [101,102], their 
survival in the chromosome may be facilitated by 
the fact that their presence does not interfere with 
the gene dosage distribution of origin-proximal 
genes. 
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31.6.3.2 Are genes transcriptionally orientated with 
direction of replication? 

In 1989, Brewer [103] noted that the transcriptional 
units specifying genes concerned with the syn- 
thesis of DNA, RNA and protein are orientated on 
the chromosome such that transcription from, and 
replication through, these genes occur in the same 
direction. She speculated that transcription 
proceeding in the opposing direction might 
interfere with replication, and predicted that genes 
with strong promoters (i.e. likely to be ‘on’ all the 
time) would be orientated so as to avoid potential 
conflict. Her analysis of 38 transcriptional units 
appeared to confirm this expectation. In an 
extended analysis of gene orientation [104], she 
found that of over 600 genes 70% appeared to be 
orientated in the same direction as replication; 
among 97 genes encoding RNAs and proteins 
active in translation, 92 were orientated with 
replication. 

Burland et al. [73] tested the original Brewer 
hypothesis further, using the sequence from 
81.5-86.5min (which contains the origin). In this 
region two-thirds of the genes are orientated with 
replication, but no correlation between promoter 
strength and orientation could be demonstrated. 
Recent analysis of the much larger 1.6 Mb segment 
showed only a weak correlation between directions 
of transcription and replication overall, although 
there are some regions of strong correlation. 
However, genes likely to be highly expressible, as 
deduced from their biased codon usage, are found to 
be about five times more likely to be orientated with 
replication than against it. 


31.6.3.3 Rearrangements 

Is the order of genes on the chromosome important? 
This question arises because gene arrangement has, 
for the most part, been conserved between E. coli and 
S. typhimurium [105], organisms that have been 
separated, evolutionarily speaking, for millions of 
years. One approach to answering this question has 
been to determine whether inversions which radi- 
cally alter gene order are in fact deleterious [106]. 
Inversions can occur by recombination between 
small chromosomal duplications placed at the 
desired end points of the inversion. Regeneration of 
a selectable gene from its separated halves was used 
for selection and required inversion of the DNA 
located between duplicated elements positioned at 
pairs of preselected points on the chromosome [107]. 
The results were striking. Strains with inversions, 
even very large ones, that included the region from 
17 to 44min, or that had occurred within a single 
replicating arm and did not include this region, 


grew normally, suggesting that, for the most part, 
gene order is not critical. 

There were two major classes of exceptions. The 
first were inversions which resulted in the relocation 
of the 86-91min region, which encodes RNA 
polymerase subunits and contains three of the seven 
rRNA cistrons, to a point distant from the origin. 
Strains with these rearrangements were found to be 
rich-medium sensitive, lending support to the idea 
that high gene dosage of these loci is required at 
rapid growth rates. The second group of inversions, 
with one or both ends between 17 and 44min were 
either poorly tolerated or not obtainable at all. The 
reason for this is not understood. A plausible expla- 
nation, that inversions ending within the terminus 
region cause replication pausing at inverted Ter 
sites, has been excluded [108]. 

If inversions have little selective disadvantage, it 
might be anticipated that they would occur 
commonly. This does not appear to be the case. 
When K-12 strains are compared by restriction 
analysis using enzymes that cut at about 20 sites, 
they differ from one another by seven or eight 
insertions or deletions greater than 1 kb in length, 
but seldom exhibit inversions [11]. The genetic maps 
of E. coli and S. typhimurium are colinear, except for 
insertions and deletions, in all regions other than 
that near the replication terminus [109,110]. The two 
species are distinguished in this region by a single 
large inversion which spans the terminus. Thus, the 
order of genes on the chromosomes of enteric organ- 
isms appears to have been well conserved, sug- 
gesting that there is selective value in the existing 
arrangement. (Ironically, considering the rarity of 
inversions, the reference strain chosen by Kohara 
to construct the physical map harbours a large 
inversion relative to other K-12 strains [111]. The 
inverted DNA, generated by recombination be- 
tween rrnE, at 90.5min and rrnD at 72min, is 
depicted in its more usual orientation on the 
standard physical map.) 


31.6.4 Repeated sequences 


Short tandemly repeated sequences, such as those of 
eukaryotic microsatellite DNA, are not characteristic 
of E. coli DNA, but there are several known families 
of interspersed repeats. Some of these have defined 
functions; others, present in relatively few copies, 
appear to have been horizontally transferred and are 
unlikely to have significant functional importance. 
A final group is present in many copies and may 
well have functional roles, although the exact nature 
of these roles remains obscure. 
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31.6.4.1 Interspersed repeats of known function 

These include the seven rrn operons, each of which 
encodes 5S, 16S, and 235 rRNA in addition to several 
tRNAs (these differ between loci). Some of the 
equivalent RNA sequences are identical, others 
differ slightly. The loci contain some unusual 
oligonucleotide sequences: for instance they are rich 
in the rare CTAG tetranucleotide, and contain the 
only sites that can be cut by the enzyme I-Ceu I [112], 
which is encoded by an intron of chloroplast origin 
and is specific for a 26-bp site in 235 rRNA. 

The other defined repeated sequence with a 
known functional role is the short, nonpalindromic 
sequence, 5°GCTGGTGG, termed Chi, which pro- 
motes homologous recombination in its vicinity (see 
ref. 113 for a review). Chi may occur either within or 
between genes. Chi-stimulated recombination is 
mediated by the major E. coli recombinational 
enzyme, the RecBCD complex, and Chi is thought to 
be a recognition site for the RecD subunit of this 
complex. An 8-bp sequence would be expected to 
occur by chance only about 70 times on the entire 
chromosome. Chi, in contrast, and consistent with 
its possessing a functional role, occurs over 20 times 
more frequently than this. Chi activity is directional, 
such that two Chi sequences, in opposing orienta- 
tion, should be required to stimulate replacement of 
the DNA between them with homologous sequence. 
It is thus remarkable to find that over 90% of Chi 
sites are orientated relative to replication (discussed 
in ref. 73); Chi appears on one strand only, the 
leading strand of newly replicated DNA. This 
suggests that the primary role of Chi may be related 
to events at the replicating fork rather than to the 
stimulation of recombination between nonreplicat- 
ing fragments; further information is required. 


31.6.4.2 Sequences likely to have been acquired 

as a result of horizontal transmission 

These include elements ranging from less than 1 kb 
long, to phage-sized sequences: all contain at least 
one ORF. 


Insertion sequences Insertion sequences (IS elements) 
are transposable elements roughly 0.8-1.5kb long; 
they are all similarly organized, with repeated 
sequences flanking a transposase gene (see ref. 114 
for a collection of reviews). The E. coli chromosome 
contains about 10 different types of these elements, 
each present in different copy numbers. They trans- 
pose or are lost with frequencies well above that at 
which point mutations occur, resulting in extensive 
variation between strains as to numbers and posi- 
tions of particular elements. Because IS elements 
transpose to positions within, as well as between, 
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genes, they are often the agents of spontaneous 
mutation. Birkenbihl and Vielmetter [115] located 
the positions of five types of IS element in three K-12 
strains. Copy number varied from 1 to 23; the three 
strains harboured, respectively, a total of 42, 46 and 
57 IS sequences of the types studied. IS sequences 
are not genetically silent; they not only inactivate 
genes into which they are inserted, but can also 
activate genes either by the introduction of 
promoters, or by the alteration of DNA structure in 
their vicinity. 

Transposons are related to IS elements and are 
probably derived from them. They are distinguished 
by the presence of genes for selectable characters 
which allows their genetic manipulation. Trans- 
posons are absent from wild-type K-12 cells but are 
transmitted in natural populations by plasmid 
vectors. 


Longer elements Rhs elements (reviewed in ref. 116) 
are a family of long (7 kb is typical) complex sequen- 
ces, with a tantalizing organizational similarity to 
the mammalian transposable sequence LINE-1. 
They were initially recognized as chromosomal 
hotspots for the initiation of duplication events. Rhs 
sequences are found in the chromosomes of many, 
but not all, naturally occurring E. coli strains. E. coli 
K-12 has five and part of a sixth; in total, they 
account for 0.8% of the chromosome. Each consists 
of a GC-rich core followed by an AT-rich extension, 
together defining an ORF which could encode a very 
large protein (up to 160000 relative molecular mass) 
with some similarities to secreted or cell-surface 
proteins. It has not been possible to obtain evidence 
that the encoded proteins are expressed or possess 
active promoters. Because the base composition of 
Rhs elements is dissimilar to that of E. coli, and 
because they are present in only some strains, they 
are presumed to be of heterospecific origin. Con- 
sistent with this idea is the fact that some Rhs 
elements appear to contain a region with the charac- 
teristics of an IS sequence. 


Lambdoid phages Escherichia coli strains can contain 
several defective prophages [98, 99, 117] with some 
sequence similarity to A, together accounting for 
several percentage of the genome. 


31.6.4.3 Short multicopy palindromic repeats 

Two classes of repeat of this type have been well 
documented. Repetitive extragenic palindrome 
(REP) [118], also known as palindromic units (PU) 
sequences [119], and enterobacterial repetitive inter- 
genic consensus (ERIC) [120] or intergenic repeat 
unit (IRU) sequences [121]. As can be deduced from 
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their names, these sequences occur between, rather 
than within, genes. They are located at the 3’-ends of 
transcripts in noncoding sequence; thus, they are 
always transcribed but never translated. Because 
they are palindromic, they can be expected to give 
rise to stem-loop structures in RNA and, possibly, 
also in DNA. These sequences, especially REP, are 
common, and, although short, are so numerous that 
they have been estimated to compose up to 1% of the 
genome. It is hard to imagine that sequences as 
numerous as these, and located only between genes, 
do not have a function; but, although several 
possible functions have been suggested, the role(s) 
of these repeated sequences remain elusive. A 
review describing REP and ERIC sequences 
appeared in 1992 and should be consulted for 
references not listed here [122]. 


REP sequences REP consists of a 30- to 40-bp 
consensus sequence, with dyad symmetry, which 
can form a stable stem-loop structure. REP sequen- 
ces may occur singly, in pairs, or less often, in clus- 
ters containing several copies. Their locations have 
been identified both by hybridization (using a con- 
sensus probe), or by analysis of existing sequence. In 
the 1.6Mb continuous sequence available in 1996, 
REP elements occur, on average, once per 13kb. 
Because the REP palindrome is not perfect, left and 
right ends can be distinguished. Within a REP 
cluster, individual units alternate in orientation and 
are separated by one of a group of other conserved 
sequence motifs. The clusters can be divided into 
groups on the basis of the other sequences which 
they contain. Complex arrangements of this sort 
have been termed bacterial interspersed mosiac 
element (BIMES). The submotif structure of BIMES 
has been analysed [46]. 

Although the primary function of REP/PU ele- 
ments is not known, at the RNA level REP sequences 
located between genes in an operon have been 
shown to be able to stabilize the upstream message, 
probably because they adopt a stem-loop confi- 
guration which limits degradation by exonucleases. 
There is also evidence that certain REP sequences, 
located between convergently transcribed genes, 
can act as transcription terminators. At the DNA 
level, REP sequences have been shown to bind DNA 
gyrase and DNA polymerase I, leading to the 
proposal that they may have a role in maintaining 
the domain structure of the nucleoid. A subclass of 
BIMES can bind IHF, a small DNA-binding protein 
which bends the DNA; this could facilitate gyrase 
action, or help to stabilize stem—loops [123]. 

REP sequences are not confined to E. coli. 
Hybridization analysis suggests that they are 
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numerous in closely related species; surprisingly, 
they can even be found in quite distantly related 
phyla. PCR using REP-based primers has even been 
used to ‘finger-print’ bacterial strains of diverse 
eubacterial species; this method may be clinically 
useful for strain identification [124,125]. When the 
chromosomal locations of REP sequences in E. coli 
and the related species, S. typhimurium, are com- 
pared, it is evident that REP has been conserved in 
certain positions but not in others. Clearly REP is 
not an essential element for the control of expression 
of particular genes; perhaps the suggested role in 
chromosome structure maintenance is the more 
plausible. 


Other repeated short palindromes The ERIC/IRU 
sequence is an imperfect palindrome 126bp long. 
Like REP, it is located in transcribed, but nontrans- 
lated, portions of the genome. It too is widespread, 
at least among the enterobacteria. However, there 
are far fewer individual occurrences per genome; 
the initial 1.6 Mb of coli sequence includes only four 
IRUs. It appears to be commoner in S. typhimurium 
than in E. coli; several locations have been identified 
in the former where IRU does not occur in the latter. 
An attempt to find other short repeated sequences 
using a computational approach [126] identified 
additional groups of short palindromes; one of these 
is the sequence characteristic of rho-independent 
terminators. Two other short elements were defined, 
one within, the second between, genes but have not 
been further characterized. 


31.7 Escherichia coli genome 
sequencing projects 


Until a few years ago, all E. coli sequences had been 
contributed, piecemeal, to the databases by labora- 
tories interested in particular genes and operons. 
Although each of these contributions was small (in 
this era of megabase sequencing efforts), by 1989 
they had together provided about 20% of the total 
genomic sequence. At that time, several large-scale 
sequencing efforts were initiated. These had the goal 
of generating long contiguous sequences which 
would, ultimately, be merged to yield the complete 
sequence of the E. coli chromosome (derived, in at 
least one case, from a single K-12 strain). Publication 
of data from these large-scale efforts began in 1992. 
By December 1995, sequence data for ~75% of the 
genome, originating from both large-scale and more 
limited efforts, was available in databases. The 
complete sequence became available in January 
1997. An analysis of the sequence with annotations 
has been published [182]. 
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31.7.1 Sequencing projects 


31.7.1.1 The E. coli Genome Project at the 

University of Wisconsin, Madison 

This group, led by F. Blattner, with US government 
funding, adopted a ‘start-from-scratch’ approach. A 
wild-type strain of E. coli K-12, MG1655, thought to be 
relatively undiverged from original wild-type 
isolates, was selected in consultation with B. 
Bachmann of the E. coli Genetic Stock Center at Yale 
University and used to make a new overlapping A 
clone library [127]. Random sequencing, d clone by A 
clone, was then to be employed to sequence the entire 
chromosome, including previously sequenced seg- 
ments. (There will be a more detailed description of 
the methodology used for sequencing and analysis/ 
annotation in a following section.) The Wisconsin 
group chose to sequence the region counterclockwise 
from 100/0 minutes. Their publications to 1995 
reported and analysed the following sequences: 

ECOUW76 (U00039): 76.0-81.5 min [76] 

ECOUW82 (L10328): 81.5-84.5 min [73] 

ECOUW85 (M87049): 84.5-87.2 min [72] 

ECOUW87 (L19201): 87.2-89.2 min [74] 

ECOUW839 (U00006): 89.2-92.8 min [75] 

ECOUWS93 (U14003): 92.8-100 min [77] 

ECOUW67 (U18997): 67.4-76.0 min 

In the title of the first paper from this group [72] 
the map coordinates, taken from the version of the 
genetic map then in use, were indicated as 84.5-86.5 
min. This sequence spans 84.5-87.2 min on the 
current map [67], and is so indicated above. There 
are no gaps between the sequences listed. The 
Wisconsin group, after overcoming some funding 
problems (see ref. 133 for a discussion of why the 
original estimates of the time required to sequence 
the E. coli genome proved overly optimistic) and 
altering certain techniques (see below) have now 
released the full sequence; ECOLI (U00096): 
Escherichia coli K-12, complete genome (4 638 858 bp) 
was deposited January 16, 1997 and released 
January 25, 1997 as a full annotated sequence. It is 
available via the Entrez Genomes division, 
GenBank, and the BLast databases. In the Entrez 
Genomes division, the entire E. coli genome can be 
examined and explored at once. In GenBank itself, 
the 4.6Mb E. coli sequence has been split into 400 
records of approximately 11500 bp each. These sub- 
sequences are also the entries used in the BLAST 
non-redundant databases for both peptide and 
nucleotide sequences. 

One can also search E. coli separately in databases 
derived from the Wisconsin entry. In the Peptide 
Sequence Databases, ‘E. coli’ contains the E. coli 
genomic CDS translations, while in the Nucleotide 


Sequence Databases ‘E. coli’ contains the E. coli 
genomic nucleotide sequences. These databases can 
be searched using a BLAST client such as NCBI’s web- 
based one at  http://www.ncbi.nlm.nih.gov/ 
BLAST/. See the Wisconsin WWW page: http:// 
www.genetics.wisc.edu/.html for instructions on 
how to down-load the sequence. The Wisconsin 
group can be reached by e-mail at: ecoli@ 
genetics.wisc.edu 


31.7.1.2 The Japanese projects 

An initial effort involved a consortium of 
laboratories in Japan. Using the A clones from the 
Kohara set (prepared from strain W3110), they 
sequenced DNA to fill the gaps in already available 
sequence in order to merge the data into contiguous 
sequences. Working clockwise from 100/0 minutes, 
they have released ~285 kbp of this composite 
sequence so far: 

ECO110K (D10483): 0-2.4 min [78] 

ECO82K (D26562): 2.44.1 min [79] 

ECOTSF (D83536): 4.0-6.0 min [unpublished] 

Funding lapsed for the project as originally 
constituted, but sequencing was resumed following 
the assembly of the Japanese Escherichia coli genome 
project team (~36 scientists) coordinated by 
Professor T. Horiuchi (National Institute for Basic 
Biology in Okazaki, 444 Japan). This group has 
continued to sequence Kohara clones, proceeding 
from ~13 min in a clockwise direction (with a few 
gaps filled with other database entries) and have 
completed over half the genome. Their sequence is 
available in a sequene of database entries (D90699- 
D90892 with some numbers in the sequence not 
used) each corresponding to a Kohara phage. Their 
results have been reported in seven papers so far 
(others are in preparation): 12.7—28.0 min [131,132]; 
28.0-40.1 min [128,173]; 40.1-50.0 min [174, 175]; 
50.0-68.8 min [176]. 

They also merged their sequence with the earlier 
Wisconsin data to generate a complete sequence of 
the E. coli genome. Their data is available on World 
Wide Web servers at http: / /bsw3.aist-nara.ac.jp/ or 
http:/ /mol.genes.nig.ac.jp/ecoli/. Information about 
numbered Kohara clones can be viewed at 
http:/ /www.ddbj.nig.ac.jp/e-coli/ecoli_list.html 


31.7.1.3 Harvard University 

A group at Harvard University, headed by G. 
Church, has reported two large contiguous sequen- 
ces which were determined as part of a program to 
establish new methods in automatic DNA sequence 
determination. They utilized a multiplex sequenc- 
ing approach (ref. 129 and see Chapter 20) to 
sequence cosmid clones derived from the E. coli K-12 
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strain BHB2600, primarily as a test-bed for further 
technology development. Although their sequences, 
ECOHU47 (U00007) and ECOHU49 (U00008) from 
the 47-49 min region, are included in the databases, 
no paper has yet been published (P. Richterich, N. 
Lakey, G. Gryan et al., unpublished data). 


31.7.1.4 Stanford group 
A group at Stanford, headed by R.W. Davis, prepared 
a sequencing library from pulsed field gel purified 
genomic AorlII fragments (minutes 4-25) of E. coli 
strain MG1655 (the same strain as used by the 
Blattner group). Four overlapping entries have 
been deposited in GenBank so far, together totalling 
526 200 bp in length (September, 1996-January, 1997): 

ECU70214 (U70214) Escherichia coli chromosome 
minutes 4-6. 

ECU73857 (U73857) Escherichia coli chromosome 
minutes 6-8. 

ECU82664 (U82664) Escherichia coli minutes 9-11 
genomic sequence. 

ECU82598 (U82598) Escherichia coli genomic 
sequence of minutes 11-12. 


31.7.2 Analysis strategy employed by the 
Wisconsin group 


As mentioned above, this group is completely 
sequencing the chromosome of a single strain of E. 
coli K-12, rather than targeting unsequenced inter- 
vals in a ‘composite genome’ constructed from 
various database entries. Since they have contri- 
buted more coli sequence than any other group, the 
strategy they have used will be described here in 
some detail. 

Blattner’s group divided the sequencing process 
into 10 steps, from strategy planning for initial DNA 
isolation to deposition of the final sequence in 
GenBank. These steps (see Table 31.2), or status 
levels, are used to define the degree of completion for 
a segence segment. While this procedure is specific to 
Blattner’s group, it provides a useful overview of the 
processes common to many genomic sequencing 
projects, and for that reason is described here. 


31.7.2.1 Preparation of random clones for 

sequencing (‘shotguns’) 

The first step in the process is the selection of a 
genomic region for sequencing, and choice of the 
source(s) of DNA for that sequencing. As noted in 
Section 31.9, several smaller genomes have sub- 
sequently been tackled as single units. However, 
when the E. coli project was initiated, the tools 
(especially sequence assembly software) necessary 
to succeed on such a scale were unavailable. For that 


reason, a set of clones from the MG1655 clone bank, 
chosen so as to minimize overlap and hence 
redundant effort, were selected for sequencing. 

To prepare single-stranded sequencing templates, 
DNA from selected A-clones is physically sheared 
and fragments of 0.7-2.0kbp in length are cloned 
into the Smal site of the M13 Janus vector [134]. 
Although physical shearing of DNA yields 
fragments which require end-repair treatment (by 
mung bean nuclease, or a combination of T4 and 
Klenow DNA polymerases) before cloning, it was 
adopted because enzymatic treatments (including 
DNAse) do not yield random-ended fragments of 
appropriate size. M13 shotgun clones containing 
inserts derived from the A-vector arms, or con- 
taining DNA which had already been sequenced 
when analysing overlapping A-clones, are identified 
by hybridization and discarded. The resulting 
library, from which unwanted clones have been 
removed, is archived and used for phage growth 
and DNA preparation [135]. 


31.7.2.2 Random data collection and assembly 

The Wisconsin group collected its first 1.5 Mb of data 
using Sequenase and *S-label in Sanger dideoxy 
sequencing reactions (see Chapter 22). The reactions 
were performed by a sequencing reaction robot 
developed by the group, and resolved on large- 
format gels. Autoradiographs were digitized, or 
scanned photoelectrically with an experimental film 
scanner and base-calling software, to generate 
sequence files for assembly into contigs. Assembly 
software prescreened the individual sequences, 
removing any A-sequence and trimming M13 vector 
sequences from the 5’ and/or 3’ ends. 

After sufficient random data for five- to sevenfold 
coverage was collected, an initial assembly was 
attempted. Typically a number of 1-5 kbp contigs 
will have been generated for each 15-20 kbp A-clone 
insert. The next step involves the systematic use of 
reverse-strand sequencing to extend the contigs and 
achieve better coverage of regions for which this 
is necessary. The Janus vector was developed to 
simplify reverse-strand sequencing [8]. Janus is an 
M13 derivative engineered so that a cloned insert 
may be sequenced from the opposite end, on the 
opposite strand, by inversion of the insert in vivo 
(flipping). The inversion is achieved by growing the 
phage on a host supplying phage ) Int recombinase, 
which acts at the att sites which flank the insert in the 
Janus vector. This permits all sequencing to be done 
from single-stranded template using standard 
primers and avoids the effort and expense of dealing 
with plasmids, PCR, or in vitro recloning to obtain 
reverse strand information. 
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Step 


Table 31.2 The 10-step 
Definition of the step, and summary of the processes involved sequencing process used in the 


10 


Wisconsin E. coliGenome 
Project strategy chosen Project. 

Select genomic region and sources of DNA for sequencing 

Plan handling of overlaps, gaps, and abutments between subsections 

Define strategy for assembly and coverage reduction 


Shotgun made 

Prepare source DNA, and verify restriction map 

Subclone DNAs into M13 Janus vector 

Select M13 clones for sequencing with plaque hybridization 
Grow M13 clones and prepare template DNA 


Data gathering 

Prepare sequencing reactions 
Load and run gels 

Collect sequence data 


Initial assembly 

Remove vector sequences and attempt assembly 
Identify problems and remove bad data, if any 

Call for Janus flips using automated coverage analyser 
Use software to call for compression resolutions 
Repeat process with new data, as necessary 


Assembled 
Automatically generate sequence with only a few gaps or trouble spots 
Update informatics system and transfer files to finishing team 


Edited 

Edit alignments on screen 

Proofread primary sequence data (traces) in ambiguous regions 
Call for more data in thin or ambiguous areas 

Design primers for primer walking experiments where needed 


Provisional 

Re-edit areas after any additional data has been added 

Generate provisional consensus sequence containing minimal 
ambiguities 


Bio-checked 

Identify ORFs with software and examine codon-usage statistics 

Scan nucleic acid and protein databases for similarities 

Re-examine data in areas of disagreement, especially possible 
frameshift errors 


Annotated 

Splice segments to create extended annotation unit, if appropriate 
Annotate a set of qualified ORFs 

Identify ORFs with gene or function where possible 


Finished 
Splice to update single contig 
Deposit annotated sequence in Genbank 


a a Se eee ee 


Choice of clones to flip was done automatically by 
a computer program, ‘Spanner’. Working from an 
unedited initial assembly, the program identified 
clones with inserts bordering the ends of short 
contigs that could be flipped to achieve closure, 


inserts that could be flipped to provide second- 
strand sequence for regions sequenced on only one 
strand, and DNA originating from near the insert-A 
junction which, if sequenced from the other end, 
could extend the total length of sequence deter- 
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mined. The program also could select a minimal set 
of M13 clones that fully covers the target segment; 
members of this set can be resequenced with dITP in 
order to resolve sequence compressions. 

Addition of data from flipped clones generally 
serves to merge the smaller contigs into a single unit 
with sufficient coverage to constitute a sequence 
ready for alignment editing. 


31.7.2.3 Editing and finishing 

Contigs were then edited on screen: alignment 
errors were corrected, misaligned repeats realigned, 
and conflicts and ambiguities resolved by again 
consulting the sequence autoradiographs or 
fluorescent traces. Additional sequence data was 
obtained if necessary for resolution of ambiguities; 
this may involve a directed approach in which 
primers are designed to allow ‘walking’. A 
minimum standard of completion required at least 
two determinations of every base, with at least one 
determination from each strand. The end-result of 
the editing process was a ‘provisional’ consensus 
sequence; discovery of the few remaining errors 
usually occurred in the course of analysing the 
sequence for its information content. 


31.7.2.4 Identification of potential genes 

in provisional sequence 

Sequencing accuracy was crucial: since an average 
E. coli gene is encoded by 1200 bp, even one base per 
kilobase, inserted or deleted as a result of sequen- 
cing error, could prevent correct identification of 
most genes. Once sequence that was as accurate as 
can be obtained, had been obtained, and alignment 
had been verified by the methods described above, 
the sequence was analysed for ORFs. Remaining 
errors that give rise to frameshifts could often be 
detected during this process. 

Sequences were examined for potential genes 
using ORF searches and pattern of codon usage sta- 
tistics. Geneplot, a commercially available (DNASTAR) 
codon usage based, gene-finding program with a 
graphical output, uses the methods of Staden [136], 
Gribskov [137], and Borodovsky [138]. It finds all 
potential ORFs in each of the six reading frames, and 
analyses their codon usages in comparison to a 
reference set (i.e. does the potential ORF contain the 

distribution of codons found in known coli genes’). 

This allows simple detection of possible frameshift 
errors, which appear as overlapping genes in 
different frames, each with codon usage scores 
falling to zero at the error point. The sequence for 
such regions was again rechecked and, if need be, re- 
sequenced. 

Of the coding region detection programs, Boro- 
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dovsky’s Genemark [12], which identifies coding 
regions simultaneously on both strands, permitting 
overlapping and complementary reading frames to 
be assessed and compared, is the most effective, and 
the program has been used in several prokaryotic 
genome projects. Although the Wisconsin group 
uses an in-house implementation, the original Gene- 
mark is available, via an e-mail: server, to auto- 
matically analyse correctly formatted sequences 
included within an e-mail: message. For infor- 
mation, send e-mail containing the message ‘help’ to 
genemark@ford.gatech.edu. or genemark@embl- 
ebi.ac.uk. Instructions for using Genemark can 
be found at: http://amber.biology.gatech.edu/ 
william /genemark.html. 


31.7.2.5 Annotation and analysis 

Comparisons were made with any previously 
reported sequence from the same region and 
differences carefully evaluated; consultation with 
other scientists regarding nontrivial sequence 
differences was initiated. In some cases, errors have 
been detected in previous database entries. These 
take many forms, perhaps the most common being 
frameshifts found in the older entries for which no 
method of resolving compressed GC-rich sequences 
had been used. Other common errors include 
restriction fragments accidentally lost or included 
at subcloning, errors in assembling pieces of a 
sequence, and inclusion of unidentified vector 
sequences. Finally some differences cannot be 
explained (without further experimentation) except 
as strain variations. Thus great caution should be 
used in the interpretation of database matches and 
differences. While automated search routines have 
made such comparisons easy to do, a good deal of 
human judgement is still required to make sense of 
the results. 

For previously unidentified potential genes, an 
attempt to identify possible homologues was made 
by comparing the new sequence to both the nucleo- 
tide and protein sequence databases. The ORFs were 
also checked for the occurrence of functional motifs, 
using the program MacPattern [139] to search the 
ProSite [140] and BLOCKS [141] databases. The 
results of these sequence searches and examination 
of the relevant literature have led to new gene 
discoveries. However, even in an organism as well 
studied as E. coli, about half of all newly sequenced 
ORFs have no certain homologues in the data bases 
and remain as hypothetical genes. 

Other sequence features, including insertion ele- 
ments, repetitive sequences, and prophages, can also 
be identified by database comparisons. Potential 
promoters were located in the sequence by matrix 
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search programs, but it is clear that what constitutes 
a good promoter is not completely represented by a 
matrix score, since well characterized promoters do 
not always correspond to those found by the 
computer. The sequences were also searched using 
pattern-matching software to locate possible 
transcription terminators, potential static bends, Chi 
sites, and other features of global significance. 
Particular attention was paid to stretches of more 
than a few hundred bases in which no features are 
readily discovered. The term ‘grey hole’ has been 
coined for such regions [72]; they are especially 
intriguing in the otherwise densely packed genomes 
of prokaryotes (the final sequence contains ~7 of 
these). 

The gene and feature identifications were 
correlated to generate a ‘final’ annotated sequence 
which was submitted to GenBank. In addition, using 
software developed to splice together such anno- 
tated sequences, each completed segment was added 
to the accumulating genomic contig. 


31.7.2.6 Improvements to the process 

In order to complete the E. coli genome sequence 
more rapidly the Wisconsin group made two major 
changes in methodology when sequencing was 
about half completed. 

First, after a transitional period in which both 
were used, the group completely substituted 
fluorescent for radioactive sequencing. Fortunately, 
the ssDNA templates used for radioactive 
sequencing with Sequenase were found to be 
suitable, with no need for modification in prepara- 
tion, for use in fluorescent sequencing with 
ABI automated sequencers. A potentially serious 
problem was that of GC compressions: since the 
~51% G+C content of E. coli leads to a compression 
per kilobase, a frameshift error could, in principle, 
occur in every gene. Comparison of the various 
fluorescent based sequencing chemistries demon- 
strated that dye-terminator chemistry with dITP 
substituting for dGTP would minimize this 
problem. DNA templates are sequenced, using Tag 
polymerase, by thermal cycling reactions with 
fluorescent-labelled dideoxy terminators. Reactions 
were set up in 96-well plates, and unincorporated 
dye-labelled dideoxynucleotides removed from the 
reactions by passage through Sephadex G50 size 
exclusion resin in a 96-well fritted plate, which can 
be centrifuged. These reactions yielded an average 
of 500-550 bases of usable sequence per template 
from a 3- to 4-h run on an ABI model 377. 

A major problem faced by all groups sequencing 
E. coli DNA stems from the difficulty of subdividing 
the genome into appropriately sized fragments for 


sequencing. The theoretical advantages of sequenc- 
ing from nonoverlapping (or minimally overlap- 
ping) fragments larger than those provided 
by A-clones are obvious. These include reduction 
in the number of shotgun subclones that need 
to be prepared and processed, elimination of 
plaque hybridization or other steps to screen out A- 
vector sequences from the subclones, and increasing 
over-all efficiency by avoiding excess depth of 
coverage in regions of overlap between adjacent (A) 
clones. 

A new approach by C. Bloch, a collaborator at the 
University of Michigan, seems to have solved the 
problem of segmenting the genome for sequencing 
without overlap redundancy. Transposons have 
long been useful as delivery vectors for E. coli, and 
have been used to introduce particular restriction 
sites to selected locations on the E. coli chromosome; 
recently engineered mini-Tn10 derivatives make 
this easier [142,143]. I-Scel is an intron-encoded site- 
specific DNA endonuclease originating from yeast 
[144, 145], for which no sites are present in the E. coli 
chromosome. Bloch’s group has incorporated an I- 
Scel cleavage site into these mini-Tn10 derivatives, 
thus providing a system for introducing unique 
cutting sites into E. coli chromosomal DNA. 

In order to use these sites to isolate DNA frag- 
ments, it is necessary to introduce them, pairwise, 
into the chromosome of MG1655, so that a specific 
fragment will be released by I-Scel digestion. To 
facilitate this, Bloch has inserted I-SceIl elements 
into the MG1655 chromosome using three different 
mini-Tn10 constructs, each with a different drug- 
resistance marker (spectinomycin, kanamycin, or 
chloramphenicol) [146,177]. Once the insert posi- 
tions were mapped, a suitable set of E. coli lines can 
be constructed for the sequencing project by using 
P1 transduction to combine appropriately spaced 
pairs of sites linked to distinguishing antibiotic 
resistance markers. 

Fragments were generated by digestion, with I- 
Scel, of agarose-embedded genomic DNA origi- 
nating from the appropriate double-insert strain. 
Preparative CHEF gel electrophoresis separates the 
fragment from the remainder of the genome, and the 
DNA is recovered by f-agarase digestion. The 
purified DNA fragment was then randomly 
fragmented by nebulization, size-fractionated, end- 
repaired, and ligated into the Smal-cut M13 Janus 
vector to provide a library of source clones for 
DNA sequencing template preparation. Fifty-four 
percent of the genome was sequenced using I-Scel 
fragments. 


735 CHAPTER 31 ESCHERICHIA COLI 


31.8 Escherichia coli databases 


In contrast to other model organism genome 
projects, there is no single central database for E. coli 
genomic data. In part, this reflects the many years of 
pregenome era work and the diverse interests found 
within the E. coli community. Nevertheless, a 
number of databases are available which organize E. 
coli genetic, sequence, and biochemical data. Those 
of which we are aware are described below. A list is 
maintained by ECDC at _http://susi.bio.uni- 
giessen.de/db_other.html. 


31.8.1 EcoMap, EcoSeq, and EcoGene 


EcoSeq is a nonoverlapping, that is nonredundant, 
E. coli DNA sequence collection which integrates 
information about genes, DNA and _ protein 
sequences. Vector sequences detected in GenBank/ 
EMBL/DDJB entries have been removed, and 
adjacent or overlapping sequence entries melded to 
generate continuous sequences. EcoMap integrates 
EcoSeq with the genomic restriction map of Kohara 
et al. [54]. As additional DNA sequences are aligned 
with the restriction map, segments of the Kohara 
map are replaced with sequence-derived restriction 
maps. EcoGene contains information about iden- 
tified and putative protein- and RNA-encoding 
genes, and translations of sequences thought to 
encode proteins. These data are correlated and cross- 
referenced with the SWISS-PROT protein sequence 
database. 

A hard copy of version 5 has been published [58], 
and version 6 has been described [59]. An update of 
the dataset for version 7 is in preparation and will be 
made available, along with documentation and 
programs to access the datasets, via anonymous ftp 
(K. Rudd, personal communication): ftp://ncbi. nlm. 
nih.gov/repository/Eco/. For additional informa- 
tion, contact Kenn Rudd (rudd@ecogene.med. 
miami.edu). 


31.8.2 The Escherichia coli database collection 


This database contains information for the entire 
E. coli K-12 chromosome, and is organized like a 
genetic map. The database can be searched for gene 
names or map positions. Coding sequences (CDS) 
are indicated for each gene — whether putative ORF 
or untranslated RNA. Regulatory regions, promot- 
ers, terminators, and IS elements are also indicated. 

A hard copy of Release 20 has been published [147]. 
The complete ECDC dataset is available by 
anonymous ftp (susi.bio.uni-giessen.de) or together 
with a Windows application on the EMBL (EBI) CD- 


ROM. It can also be queried via the World Wide Web 
with a forms-capable client: http://susi.bio.uni- 
giessen.de/usr/local/www/html/ecdc.html. For 
additional information, contact Manfred Kréger 
(kroeger@embl-heidelberg.de). 


31.8.3 Colibri 


Colibri is a relational database dedicated to the 
analysis of the E. coli genome. It was developed as a 
part of a thesis project through a collaboration 
between the Unité de Regulation de |’Expression 
Genetique (Institut Pasteur-CNRS) and the Atelier 
de BioInformatique (Institut Curie). A complete 
description of the database and its organization has 
been published [148]. Colibri is a Macintosh appli- 
cation developed with the 4th Dimension database 
engine. Version 1.3 (30 October 1994) is available via 
anonymous ftp (ftp.pasteur.fr, in the directory 
pub/GenomeDB/Colibri); a new release will be 
available soon (A. Danchin, personal communi- 
cation). For additional information, contact Ivan 
Moszer (moszer@pasteur.fr) or Antoine Danchin 
(adanchin@pasteur.fr). 


31.8.4 The Escherichia coli Genetic Stock Center 


The E. coli Genetic Stock Center (CGSC) at Yale 
University maintains a database of E. coli genetic 
information, including genotypes and reference 
information for the several thousand strains in the 
CGSC collection, a gene list with map and gene 
product information, and information on specific 
mutations. An electronic version of the E. coli linkage 
map is also under development. 

A public version of the database includes the 
information of most interest to the community and is 
accessible via the WWW, with a forms interface for 
queries, at the URL: http://cgsc.biology.yale.edu/ 
top.html 

The CGSC Web server is experimental and does 
not include all information available through the 
Sybase data base. Address questions about the data 
base contents or requests for stocks, information, or 
a guest login to Mary Berlyn (mary@cgsc.biology. 
yale.edu). 


31.8.5 The Encyclopedia of Escherichia coli Genes 
and Metabolism 


The Encyclopedia of E. coli Genes and Metabolism 
(EcoCyc) is a database integrating information 
about E. coli genes and metabolism [149]. A graphi- 
cal user interface creates drawings of metabolic 
pathways, of individual reactions, and of the E. coli 
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genomic map. Users can call up objects through a 
variety of queries and then navigate to related 
objects shown in the display window. For example, a 
user could zoom in on a region of the genetic map, 
click on a gene to obtain detailed information about 
it, navigate to the enzyme product of the gene, and 
then view the metabolic pathway containing the 
enzyme. 

For information on the EcoCyc, see the WWW 
URL: http: / /www.ai.sri.com/ecocyc/ecocyc.html 


31.8.6 The Escherichia coli Gene Database 


A compilation of E. coli genes and gene products, 
categorized by physiological function, the E. coli 
Gene Database (GenProTec) also includes homology 
information for proteins similar to at least one other 
E. coli protein [82]. The database is available by 
ftp://hoh.mbl.edu/pub/ecoli.zip or there is a 
WWW version at the URL: http://www.mbl.edu/ 
~dspace/eco.html 


31.8.7 The Escherichia coli Gene-protein Database 


The E. coli Gene-protein Database (ECO2DBASE) is 
a database containing information about E. coli 
proteins obtained by the analysis of two-dimen- 
sional protein gels, and is maintained by EG. 
Neidhardt. Full information and searching facilities 
are available at http://pcsf.med.umich.edu/eco2 
dbase. 

Questions and comments can be sent to Ruth 
VanBogelen (vanbogr@aa.wl.com) or Fred Neid- 
hardt (feneid@umich.edu). 


31.8.8 SWISS-PROT 


SWISS-PROT is a carefully curated and highly 
reliable protein sequence database which strives to 
provide a high level of annotations (such as the 
description of the function of a protein, its domain 
structure, post-translational modifications, variants, 
etc.), a minimal level of redundancy and high level 
of integration with other databases [150]. SWISS- 
PROT is not an E. coli-specific database, but its cross- 
references to other databases make it an exception- 
ally useful source of information. A. Bairoch has 
coordinated the E. coli entries with K. Rudd, and an 
index of all E. coli K-12 entries is available from: 
http: //expasy.hcuge.ch/ cgi-bin /lists?ecoli.txt. 
World Wide Web access through the ExPASy 
server allows forms-based searches of the entire data 
base by description or identification, accession 
number, author, or full text search: http: / / expasy. 
hcuge.ch/sprot/sprot-top.html 
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For additional information, contact Amos Bairoch 
(bairoch@cmu.unige.ch). 

In addition to these electronic databases, sequence 
data have been used to produce compilations of 
a variety of functional sequence elements. These 
include promoters [151], terminators [152], and 
ribosomal binding sites [153]. 


31.9 Future prospects: what can we 
learn from sequencing prokaryotic 
genomes? 


In the race to determine the first complete sequence 
of a free-living organism, E. coli—once a prime 
contender—has lost out to five prokaryotes with 
smaller genomes and lower G:C contents: first to 
Haemophilus influenzae [154] and then to Mycoplasma 
genitalium [155], followed by the archaebacterium 
Methanococcus jannaschii [178], the cyanobacterium 
Synechocystis sp. Strain PCC6803 [158,179], and a 
second mycoplasma Mycoplasma pneumoniae [180]. 
The 13Mb genomic sequence of the yeast 
Saccharomyces cerevisiae, undertaken by a large 
consortium of laboratories, was also completed 
sooner [181]. 

The sequences of many more genomes are in 
progress with some virtually completed; web sites 
are maintained for many of these. Prokaryote 
genome projects currently near completion in- 
clude those for the spore-forming soil bacterium 
Bacillus subtilis [156], the causative agent of 
leprosy, Mycobacterium leprae [157], Mycobacterium 
tuberculosis (two projects) and the archaeon 
Methanobacterium thermoautotrophicum (see web site 
at http: / /pandora.cric.com/htdocs/sequences/ 
methanobacter/abstract.html). Under the auspices 
of its Microbial Genome Initiative, the United States 
Department of Energy (DOE) has launched a 
number of prokaryotic genome projects. An initial 
initiative in 1994 supported projects to sequence 
the genomes of several archaebacteria [159]. M. 
jannaschii and M. thermoautotrophicum have been 
completed; Pyrococcus furiosus is still underway. Ina 
second round of support, projects to sequence two 
extremeophiles able to grow in boiling water, the 
archaebacterium Archaeoglobus fulgidus and the 
eubacterium Thermotoga maritimae were funded. 
Other possible candidate species for DOE genome- 
sequencing support include Sulfolobus solfataricus, 
which oxidizes sulfur, Clostridium acetobutylicum, of 
possible industrial use for alcohol production, and 
Pseudomonas aeruginosa, a particularly difficult-to- 
control hospital pathogen [160]. This is very much a 
partial list of species for which genome sequencing 
is in progress. For a tabulation of current 
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prokaryotic/microbial genomes being sequenced, 
the reader is referred to the NCGR Microbial 
Genome Site  (http://www.ncgr.org/microbe/) 
which attempts to keep track of current or com- 
pleted eubacterial, archaeal and eukaryotic genome 
sequencing projects. Over 50 projects were listed at 
the time the site was last examined (May 1997)! The 
Institute for Genome Research also maintains this 
type of information at its web site. 

In addition to genomes that are the subject of 
concerted sequencing efforts, physical maps of 
many other genomes have been constructed [161]. 
Such maps could provide the basis for efforts to 
survey the sequences of those genomes, without 
actually attempting to determine complete genomic 
sequences. 

What can we look forward to as we enter a ‘post- 
sequencing’ era, with not just one but many versions 
of the so-called ‘blueprint of life’ available for 
examination (see ref. 183 for a discussion)? 


31.9.1 The E. coli K-12 story 


31.9.1.1 Identification of gene function 

E. coli K-12 genes can be divided into three classes on 
the basis of our current information about them. 
Between one-half and two-third [83] of gene 
products can be assigned definite functions, either 
on the basis of existing genetic or biochemical 
knowledge or because strong similarity to a well- 
characterized gene product in another organism 
permits confident prediction of function. A further 
16% of gene products show some sequence 
conservation, perhaps of a known motif, which 
provides a guidepost to function. For the remaining 
20% of predicted gene products there are no 
functional clues. While some apparently coding 
DNA may not normally be transcribed, what we 
currently know about the organization of bacterial 
genomes suggests that ‘junk DNA’ is rare. How will 
functions for the ~40% of genes which are still 
uncharacterized be deduced? 

First, by application of existing and accumulating 
biochemical and genetic information: the commu- 
nity of E. coli researchers possesses a considerable 
body of knowledge which is continuing to be 
correlated with accumulating sequence. Identifi- 
cation of functional motifs in new sequences can 
help in their assignment to already described 
functions. 

Second, by assigning new genes to existing 
regulons: for instance, a systematic approach to 
studying global regulatory mechanisms has been 
described for E. coli, in which mRNA levels 
expressed from various regions of the chromosome 


under different conditions are measured by hybri- 
dization to DNA derived from an ordered set of A- 
clones [162]. The technique allows detection of both 
induced and repressed levels of gene expression, 
and is applicable to a variety of chemical, physical, 
or physiological treatments. For example, use of this 
method to examine gene expression under heat- 
shock conditions allowed the characterization of 26 
new heat-shock loci in E. coli [163]. Sequence studies 
can also be used for regulon assignment: identifi- 
cation of DNA-binding sites for specific regulators 
upstream of new ORFs can permit their inclusion in 
existing regulons. Identification of a new motif 
upstream of a group of genes can allow them to be 
classed as a previously unrecognized regulon. 

Third, knowledge of the positions and sizes of 
uncharacterized ORFs allows protein products to be 
sought (is the ORF transcribed /translated?) and the 
phenotypic consequences of insertional inactivation 
to be detailed (is the ORF dispensable? If not, what 
are the physiological consequences of inactivating 
it?). Such studies should also help to clarify what is 
actually regulated by genes classified as probable 
regulatory genes solely on the basis of sequence. 

Finally, the continually accumulating sequence 
databases for all organisms will continue to provide 
strong homologues for uncharacterized ORFs that 
will serve as important pointers to function. 


31.9.1.2 Comparisons with other E. coli strains: the 
power of genome scanning 

Sequencing information and resources can be used 
to quickly identify major sequence variations be- 
tween related strains which are associated with 
important phenotypic differences. Although E. coli 
K-12 is harmless, other E. coli strains cause disease. A 
35-kbp locus has been identified in enteropatho- 
genic and enterohaemorrhagic strains of E. coli, the 
presence of which is correlated with a specific 
histopathological effect on intestinal epithelial cells. 
Using mapping membranes of the Kohara E. coli K- 
12 clone set, and the sequence data from E. coli K-12, 
this locus was characterized as an insert relative to 
the K-12 genome. In uropathogenic E. coli strains, a 
different block of virulence genes is inserted at this 
same site, which is also the integration site for the 
E. coli retronphage phiR73 [164]. 

Similarly, a group of eight genes comprising the 
fimbrial gene cluster in pathogenic serotype b 
strains of Haemophilus influenzae was found to be an 
insertion relative to the genome of the nonpatho- 
genic Rd strain [154]. 

Genome scanning — single-pass sequencing of the 
entire genomes of prokaryotic strains of interest, 
followed by comparison to the more fully character- 
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ized genomes of related strains/species — is likely to 
reveal other instances of such pathogenicity islands. 
Viewed in a more general manner as modules 
conferring new properties on a strain, the existence 
of these elements indicate that the old idea of a 
modular genome may well hold not only for 
bacteriophages [165], but for their hosts as well. One 
can predict that such modules will not be confined to 
pathogenicity determinants. Some of the gene 
clusters with atypical codon usage that have been 
described for E. coli K-12 and hypothesized to have 
been horizontally transmitted [85] may well owe 
their integration into the genome to mechanisms 
similar to those responsible for the acquisition of 
pathogenicity determinants. 


31.9.2 Evolutionary insights and 
other biological questions 


Prokaryotes (the Archaea and the Bacteria, the latter 
also referred to as Eubacteria) comprise two of the 
three superkingdoms of living organisms [166]. Full 
genomic sequences will soon be available for several 
representatives of each of these groups, and a full 
sequence is already available for the budding yeast 
Saccharomyces cerevisiae, a member of the Eucarya 
(eukaryotes), the third superkingdom. This will 
enable proteins to be divided into those that appear 
to have evolved relatively recently (conserved only 
in closely related genera) and those with much 
longer histories. Indeed, it has already been reported 
that 40% of coli proteins contain ancient conserved 
sequences, shared with Eucarya or Archaea [80], 
although it is not clear whether the extent of these 
similarities is sufficient to indicate evolutionary 
conservation of complete polypeptides as opposed 
to recruitment of useful motifs. To date, the con- 
struction of phylogenetic trees has mostly relied on 
comparison of rRNA sequences. The ability to 
construct trees based on sequence variation amongst 
a number of groups of homologous proteins should 
provide a powerful tool for phylogenetic analysis. In 
particular the relationship between archaebacteria 
and eubacteria and eukaryotes should become 
much clearer. 

A second question that analysis of genome 
sequences should help to answer is what constitutes 
a ‘minimal genome’. Is there a basic minimum set of 
genes required by any free-living organism and how 
big is it? Full analysis of the M. genitalium genome 
should help to answer this question. Possessing the 
smallest known genome of any free-living organism, 
this cell-wall deficient, fastidious prokaryote has a 
genetic complement of only 470 genes, perhaps 
10-15% as many as E. coli. Although normally it 


obtains much of its nutrition directly from its host, 
M. genitalium can also grow independently. Thus, we 
may ask whether its small genome defines a basic set 
of genes that is shared by all free-living bacteria? The 
wealth of genome sequences soon to be available 
should enable us to answer this question. 

If there prove to be core proteins common to all 
bacteria, and a population of horizontally transmis- 
sible gene clusters required by none, are there also 
unique genes which define a bacterial species? Or is 
a bacterial species simply a unique combination of 
genes, each of which is shared with some other 
species? It is to be hoped that comparative sequence 
studies will enable us to answer such questions. 


31.9.3 Practical considerations 


There are a number of clear practical benefits to be 
derived from functionally defining bacterial genes 
and we close with a short list of these. 

First, gene functions are much easier to define in 
E. coli than in humans. Identification of the functions 
of homologous bacterial genes has helped to define 
the functions of genes involved in inherited human 
diseases such as colon cancer susceptibility [1,2] and 
cystic fibrosis [167,168]. 

Second, there are many potential economic 
benefits to be derived from using bacteria, or their 
enzymes, in industrial processes. It is no accident 
that several of the organisms for which sequencing 
projects are underway are thermophiles, able to 
carry out biological processes at high temperatures, 
or originate from deep sea vents, where pressures 
are extreme. It is to be expected that enzymes of 
industrial importance will be identified in these 
organisms. In addition to enzymes as potential 
chemical catalysts in heavy industry, there are 
certain to be new uses in the research market, where 
high-temperature DNA polymerases have already 
engendered PCR and greatly improved DNA 
sequencing. Contributions to energy production 
(methane generation) and toxic waste degradation 
are also likely. 

Finally, let us not forget that bacteria cause 
disease, of both animals and plants. Even in the 
minimal M. genitalium genome, about 5% of the 
genes are devoted to evading the host’s immune 
system. Comparison across many species of homo- 
logous proteins likely to form targets for antibac- 
terial agents should help to identify conserved 
regions which could aid in targeted drug design. 
Comparison of virulent and avirulent strains should 
also identify surface proteins unique to the virulent 
organisms, providing a means of developing 
specific vaccines. 
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746 CHAPTER 32 PROBLEMS IN PLANT GENOME ANALYSIS 


32.1 Introduction 


Knowledge of genome relationships within and 
between various plant taxa is of great use to many 
scientists, including cytogeneticists, plant breeders, 
taxonomists, evolutionists, molecular biologists and 
biotechnologists. Genome analysis provides useful 
information on chromosome pairing relationships 
and, hence, on the possibilities of transferring 
desirable traits between different species of higher 
plants. An understanding of genomic affinities helps 
to formulate effective breeding programmes design- 
ed to transfer desirable genes from wild relatives 
or primitive varieties of crop plants into otherwise 
superior cultivars. Many plants, and crop plants in 
particular, are polyploids, often with one or more 
sets of chromosomes derived from different sources; 
this means that the questions posed in plant genome 
analysis, and some of the problems encountered, 
are rather different from those pertaining to the 
analysis of animal genomes, especially those of 
humans and other mammals. 

Before discussing problems encountered in plant 
genome analysis, we must define a genome and the 
specific sense in which it will be used in this chapter. 
In general biological terms, a genome refers to all 
genetic material or the complete gene complement 
contained in a set of chromosomes in eukaryotes, or 
in the equivalent in prokaryotes. Thus, a genome 
consists of a single chromosome in bacteria, or a 
DNA or RNA molecule in viruses. In higher organ- 
isms, a genome represents one complete haploid 
set of chromosomes. Eukaryotic genomes, and 
many plant genomes in particular, are characterized 
by the occurrence of highly repetitive noncoding 
sequences. The evolution of higher plant species has 
often been accompanied by quantitative changes in 
this noncoding DNA fraction [1]. 

For this chapter, a plant genome will be defined as 
a complete basic set of chromosomes (denoted by 
the small letter x) inherited as a unit from one parent. 
Thus, in diploid higher plants, a genome refers to 
only one complete haploid (monoploid) set of 
chromosomes, called the haplome [2]. A diploid 
species, for example, diploid wheat (Triticum 
boeoticum Boiss.) has a double dose of a single 
genome designated by the capital letter A (so the 
diploid has the constitution AA). A polyploid plant, 
on the other hand, has several different genomes; for 
example, hexaploid common wheat (T. aestivum L.) 
has three different genomes, called A, B and D, each 
present in two copies in the somatic cells of the 
hexaploid (AABBDD). Thus, wheat is a hetero- 
genomic polyploid or allopolyploid. When a plant 
(e.g. alfalfa) has more than two doses of the same 


genome, it is called a homogenomic polyploid or an 
autopolyploid. 

Traditionally, in genome analysis a genome refers 
to the nuclear genome, unless stated otherwise. 
However, mitochondrial and plastid DNA (e.g. 
chloroplast DNA) have been used in phylogenetic 
investigations [3-5]. Chloroplast DNA analysis 
showed that the chloroplast genome of the primitive 
wheat T. timopheevii is equivalent to that of the wild 
grass Aegilops speltoides [6]. 

Genome analysis attempts to determine the 
genomic constitution of polyploid species and eluci- 
date genomic affinities among plant taxa. Various 
classical cytogenetic and biochemical techniques are 
used in studying genome relationships; these have 
more recently been supplemented with molecular 
methods. This chapter discusses some of the 
methods of genome analysis and the problems 
encountered in these studies, with the emphasis on 
the traditional cytogenetic approaches. I will discuss 
problems with particular reference to the cyto- 
genetic analysis of relationships within the grasses, 
the large and important group of monocotyledons to 
which many of our staple crop plants belong. 
Progress in mapping and analysing the genome 
of rice (Oryza sativa) by molecular methods is 
described in Chapter 34, while the mapping of the 
genome of the model dicot Arabidopsis thaliana is 
described in Chapter 33. 

Several techniques of genome analysis in plants 
have been employed over the years. These include: 

* crossability: reproductive isolation between taxa; 
¢ karyotypic analysis on conventionally stained 
somatic or pachytene chromosomes; 

¢ karyotypic analysis on Giemsa-banded somatic 
chromosomes; 

* chromosome pairing in hybrids at different 
ploidy levels; 

° chromosome pairing in hybrids in the presence of 
the Ph1 gene of wheat; 

¢ discrimination of pairing between parental 
chromosomes; 

e use of mathematical models on meiotic pairing 
data; 

* protein electrophoresis; 

* molecular tools of genome analysis. 

Each of these techniques has certain inherent 
advantages and disadvantages. The relative useful- 
ness of a technique depends upon the degree to 
which it measures, directly or indirectly, the similar- 
ity of the nuclear DNA of the related species. 
Genome analysis can be subjective at times and has 
certain limitations, as do most other biosystematic 
criteria. It is nevertheless one of the most useful 
methods for revealing phyletic relatedness. The 
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merits and limitations of each of the criteria of 
genome analysis are discussed below. 


32.2 Crossability: reproductive 
isolation between taxa 


The phylogenetic relationship between two plant 
taxa is often indicated by the relative ease with 
which they can be hybridized. Crossability between 
two taxa and fertility of the resultant hybrid are 
useful criteria for elucidating genome relationships. 
These measures have long been used for estimating 
the degree of relationship between parental species 
[7]. When two taxa produce completely sterile 
hybrids and, hence, are reproductively isolated, it is 
probable that the taxa are genomically distinct. This 
technique has its own limitations, however. 

1 Barriers to crossing may limit the production of 
hybrids. Crossability may be under genetic control. 
For example, crossability of wheat (T. aestivum) and 
rye (Secale cereale) is controlled by two recessive 
genes, kr1 and kr2, in wheat [8]. Therefore, the 
success or failure of hybridization of two taxa may 
be influenced by the particular genotypes used. The 
crossability genes may act differently on different 
crosses; kr1 and kr2 genes prevent hybridization of 
wheat with its close relative rye but not with the 
remotely related maize [9]. The degree of fertility of 
a hybrid may also be influenced by the genotypes of 
the parents [10]. 

2 A cross may be successful only in one direction 
and its reciprocal may invariably fail, indicating the 
influence of cytoplasm on crossability, for example, 
in the legume genus Glycine [11]. Therefore, a hybrid 
studied only in one direction may produce erro- 
neous results. 

3 In some cases, two closely related taxa may cross 
very easily but the hybrids are completely sterile. 
For example, geographically isolated ecotypes of 
hexaploid tall fescue (Festuca arundinacea Schreb.) 
cross very readily but produce sterile hybrids. 
Interaction between parental genotypes results in 
the inactivation of the regulatory mechanism that 
controls diploid-like chromosome pairing, resulting 
in high homoeologous pairing and hence in sterility 
[12]. (Genetically and evolutionarily related chro- 
mosomes from different genomes within a hetero- 
genomic polyploid or from related species are 
known as homoeologous chromosomes, which are 
capable of pairing among themselves.) This is a 
novel mechanism for the creation of reproductive 
isolation barriers between infraspecific categories 
within a species. Sterility in such intervarietal 
crosses obviously cannot be used for assessing 
genome relationships. 


32.3 Karyotypic analysis on 
conventionally stained somatic or 
pachytene chromosomes 


In simple terms, karyotype is defined as the 
morphology of chromosomes. In plants, the karyo- 
type is generally studied at somatic metaphase in 
actively dividing root tips (or sometimes in shoot 
tips), in pollen mitosis, or at pachytene of meiosis. 
Chromosome number, total chromosome lengths, 
arm ratios, secondary constrictions and satellited 
regions or nucleolar organizer regions (NORs) con- 
stitute important parameters for karyotypic analy- 
sis. They have aided genome analysis and, hence, 
phylogenetic investigations. 


32.3.1 Uses of karyotype analysis 


Karyotype analysis has been used to study genome 
relationships in several plant groups, particularly 
the grasses. Avdulov [13] was among the first to 
use cytological features to establish evolutionary 
relationships among species and genera. Using cyto- 
logical criteria, he attempted a phylogenetic sub- 
division of the grasses and his publication entitled 
Cytotaxonomic Investigations in the Family Gramineae 
marked the beginning of a new era in grass classifi- 
cation. Remarkably, Avdulov’s classification was 
borne out by studies based on anatomy and geo- 
graphical distribution. 

The early studies on karyotype analysis used 
reconstructions prepared from serial sections of root 
tips and anthers and were therefore tedious and 
time consuming. However, the advent of squash 
techniques speeded up work on chromosome karyo- 
typing, which has been carried out for numerous 
plant groups including the wheat group. Thus, 
based on the similarity of the satellite chromosomes 
of the diploid species Aegilops speltoides (Tausch) to 
those of polyploid wheats, Riley et al. [14] inferred 
that this diploid was the source of the B genome of 
polyploid wheats. This inference has since been 
challenged because karyotypic and pairing data 
show another diploid in the wheat group, T. searsti 
Feldman & Kislev, as a more likely source of the B 
genome [15]. This point remains controversial. 
Nevertheless, the B genome of polyploid wheats is 
very similar to the genome of Ae. speltoides. 


32.3.2 Problems of using karyotypic features in 
genome analysis 


Karyotypic data obtained from conventionally 
stained (e.g. acetocarmine, acetoorcein, or Feulgen- 
stained) condensed chromosomes have a limited 
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usefulness for assessing genomic or phylogenetic 
relationships. There are inherent inaccuracies in 
measurements taken on condensed chromosomes 
at mitotic metaphase. Computer-aided karyotypic 
analyses [16] may facilitate precise measurements of 
chromosomes. Fukui [17] developed a chromosome 
image analysing system, CHIAS, especially for plant 
chromosomes. CHIAS offers two main advantages 
over conventional karyotyping techniques. 

1 It enables the analysis of a large number of chro- 
mosome spreads in a relatively short time; it is pos- 
sible, for example, to analyse more than 250 barley 
metaphase plates within a week by semiautomatic 
processing of the chromosome images [18]. 

2 Results obtained are accurate and reproducible. 

The fact that somatic chromosomes from related 
species are similar in total size or arm ratio does not 
necessarily mean that they are similar in gene con- 
tent, however. Similar genomes can certainly have 
dissimilar karyotypes and vice versa. Moreover, it is 
not generally easy to identify unequivocally differ- 
ent chromosomes or their homoeologous partners. 
Somatic karyotyping can therefore sometimes lead 
to erroneous conclusions [19-21]. 

The satellited (SAT) chromosomes are relatively 
easy to identify, and serve as useful karyotypic 
landmarks. Chennaveeraiah [22] found that chro- 
mosome identification in the wheat group, based on 
chromosome size and arm ratio, was difficult, 
although in many cases the SAT chromosomes could 
be identified unambiguously. However, in inter- 
specific or intergeneric hybrids, amphiplasty can 
mask the visual expression of the NOR(s) of one of 
the parental species [23]. (Amphiplasty is the 
phenomenon of the masking of the NOR(s) of one 
species by those of the other in their hybrids.) The 
usefulness of karyotypic features in elucidating 
genomic relationships is therefore severely limited. 

Mitotic karyotyping is of particularly limited 
value in species with small chromosomes, where it 
may be difficult even to sort out homoeologous 
members of the complement. Although this problem 
may be partly obviated by karyotyping at pollen 
mitosis (where the haploid complement is repre- 
sented) or even at somatic metaphase in root tips of 
haploid plants, mitotic karyotyping has a limited 
usefulness. The pachytene stage of meiosis could 
prove relatively more useful for precise karyotypic 
analysis. In addition to accurate measurements of 
chromosomes, it permits a precise study of chro- 
momere patterns and other finer details of the 
already paired chromosomes. Thus, one has to 
measure only the gametic (n) number of chromo- 
somes. Generally, plant taxa with low chromosome 
number are suitable for pachytene karyotyping. 


Using this technique, Singh and Hymowitz [24] 
were nevertheless able to study genome relation- 
ships between species of Glycine with high chromo- 
some number. 


32.4 Karyotypic analysis on Giemsa- 
banded somatic chromosomes 


Progress in different areas of cytogenetics depends 
on our ability to identify not only individual 
chromosomes but also parts of chromosomes. Some 
plants have similar chromosomes which cannot be 
distinguished by conventional staining. The advent 
of Giemsa banding techniques for mammalian 
chromosomes [25-27] provided cytogeneticists with 
a powerful tool for karyotypic analysis and chromo- 
some mapping (see Chapter 7). These banding tech- 
niques differentially stain chromosome regions rich 
in constitutive heterochromatin that contains a 
large amount of highly repetitive DNA. This 
produces a unique pattern of dark and light bands 
along the length of a chromosome, which can be 
used in chromosome identification. Some of these 
banding techniques have been successfully used for 
karyotyping plant chromosomes [28-35]. 


32.4.1 Advantages of chromosome banding 


Both C- and N-banding techniques have been used 
to identify individual chromosomes of numerous 
plant species. Giemsa N-banding was originally 
developed for differential staining of the NOR in 
both plant and animal chromosomes. However, this 
method was also found useful for detecting con- 
stitutive heterochromatin and hence for identifi- 
cation of specific chromosomes in cereals [32]. 
Giemsa C-banding gives the best resolution of 
chromosome-specific bands and has allowed the 
identification of all the 21 (A-, B- and D-genome) 
chromosome pairs and most chromosome arms in 
hexaploid wheat [35] and chromosomes of wild 
grass species in their hybrids with wheat (Fig. 32.1). 
The value of chromosome banding in genome 
analysis has been discussed by Friebe and Gill [36]. 
On the basis of the banding pattern, a particular 
genome in a polyploid crop plant may be traced 
back to a putative diploid progenitor. Thus, C- 
banding in Ae. squarrosa L. (=T. tauschii (Coss.) 
Schmalh) gives a pattern very similar to that of the 
D-genome chromosomes in hexaploid wheat [37]. 
To get good resolution of major and minor bands, 
C- and N-banding analysis should be done on 
relatively less condensed (prometaphase) chromo- 
somes (see Chapter 7). 

Chromosome banding has also proved useful in 
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Fig. 32.1 C-banded somatic 
chromosomes of trigeneric 
hybrids (2n = 4x =28; ABJE 
genomes). The hybrids were 
made between durum wheat 

(T. turgidum; 2n = 4x =28; AABB), 
Thinopyrum bessarabicum 

(2n =2x =14; JJ), and Lophopyrum a 
elongatum (2n =2x =14; EE). Note 
that each of the 28 chromosomes 
can be identified on the basis of 
its diagnostic banding pattern. 
For the sake of convenience, 
chromosomes of each genome 
are numbered 1—7. The 
numbering of chromosomes of 
the J and E genomes does not 
reflect their homoeology with 
corresponding chromosomes of fy 
the A and B genomes of durum 

wheat. From ref. 64. 


3A 
i) 


identifying chromosome segments involved in 
translocations. Using C-banding, Gill and Kimber 
[38] not only demonstrated translocations between 
two different wheat chromosomes but also between 
wheat and rye chromosomes. Similarly, reciprocal 
translocations involving chromosome 6A and 1G 
were revealed in T. timopheevti (2n =4x =28; AAGG) 
[39]. Thus, C-banding could also be useful in 
studying restructured genomes. 


32.4.2 Problems in banding analysis 


The usefulness of a banding technique depends 
upon the uniqueness of the banding patterns 
created. There are several variables in a C-banding 
protocol and it may not always be possible to obtain 
consistent and repeatable results. The interpretation 
of banding patterns is sometimes difficult, espec- 
ially when dealing with small structural changes. 
Moreover, the banding pattern may vary among 
members of the same species [37,40,41]. Such a 
banding polymorphism could complicate banding 
and hence genome analysis in a particular plant 


group. Variation in C-banding patterns between’ 


homologous chromosomes within and between 
different plants and cultivars has also been reported 
in cross-pollinating rye (S. cereale L.) [42-44] and in 
barley [45]. Therefore, a ‘chromosomal passport’ 
based on a typical banded karyotype cannot always 
be prepared for a particular plant species. Such 
problems will tend to limit the utility of banding 
techniques. 


Although barley (Hordeum vulgare L.) has a low 
chromosome number (21 =14) and relatively large 
chromosomes compared to other plant species, 
problems have been encountered in its karyotypic 
analysis. It has a symmetrical karyotype, the mor- 
phology of the chromosomes being similar except 
for chromosomes 5, 6, and 7. It was therefore diffi- 
cult to distinguish every chromosome until the 
C-banding technique [46] was used. However, even 
then, differences of opinion persisted among 
cytogeneticists. Linde-Laursen’s [46] assignment of 
the short and long arm of chromosome 1 based on 
C-banding was revised by Noda and Kasha [47], 
and the revision accepted by Linde-Laursen [45]. 
However, when Singh and Tsuchiya [48] studied the 
chromosomes of barley by Giemsa N-banding they 
adopted Linde-Laursen’s original short and long 
arm assignment for chromosome 1. This example 
well illustrates the difficulty encountered in plant 
chromosome studies. 


32.5 Chromosome pairing in hybrids 
at different ploidy levels 


The principal criterion for assessing genomic affini- 
ties between species has been and still is the study of 
chromosome pairing in their hybrids at different 
ploidy levels. The degree of chiasmate pairing be- 
tween parental chromosomes is generally a reliable 
indicator of the degree of genomic relatedness. 
However, the situations under which chromosome 
pairing occurs must also be examined. 
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32.5.1 Diploid hybrids: their limitations 


Hybridization between two diploid species—for 
example, AA and BB—results in diploid hybrids 
with AB genomes. Chromosome pairing in such 
hybrids may suggest a degree of relationship be- 
tween the parental genomes. However, because 
diploid hybrids have only two sets of chromosomes, 
they lack conditions for preferential pairing. The 
chromosomes of one genome —that is, A— have the 
option of pairing only with those of the other 
genome — that is, B. Chromosome pairing in diploid 
hybrids does not therefore provide sound infor- 
mation on genomic affinity. For this reason, many 
cytogeneticists do not rely on chromosome pairing 
in diploid hybrids as a means of genome analysis. 
Kimber and Feldman [49] contend that hybrids 
between diploid species are ‘essentially useless’ for 
genome analysis, and we take an essentially similar 
position [21,50,51]. 

However, the degree of sterility in diploid hybrids 
may yield useful information on the relationship of 
the constituent genomes. If the hybrids are com- 
pletely sterile, the two genomes in them may be 
considered to be differentiated. 


32.5.2 Triploid hybrids: a realistic test of 
intergenomic affinity 


To obviate the limitations of chromosome pairing 
encountered in diploid hybrids, autoallotriploid 
hybrids (e.g. AAB or ABB) derived by crossing a 
synthetic autotetraploid of one species (e.g. AAAA 
or BBBB) with a diploid of the other (e.g. BB or AA) 
should be studied. (An autoallotriploid hybrid has 
two doses of one genome and one of the other, e.g. 
AAB or ABB. Thus, AAB is auto with respect to 
genome A, and allo with reference to the two 
genomes.) The rationale of using such autoal- 
lotriploids is to create conditions for preferential 
pairing among chromosomes of different genomes 
to ascertain their relative affinities. If the chromo- 
somes of the duplicated genome pair preferentially 
as bivalents and the chromosomes of the single 
genome remain largely unpaired, the two genomes 
may be considered to be distinct. However, if 
chromosomes of the single genome offer synaptic 
competition to the homologous chromosomes of the 
duplicated genome, resulting in a high frequency of 
trivalents, the two genomes could be considered to 
be closely related. For example, on the basis of 
chromosome pairing in various combinations of 
autoallotriploids among Italian ryegrass (Lolium 
multiflorum Lam.), perennial ryegrass (L. perenne L.), 
and meadow fescue (Festuca pratensis Huds.), we 


concluded that there is little structural differentia- 
tion among the chromosomes of the three species 
and, thus, no effective isolation barrier to gene flow 
from one species to another [52]. However, studies 
on autoallotriploids (JJE) between the grasses Thino- 
pyrum bessarabicum (JJ) and Lophopyrum elongatum 
(EE) showed that J and E are distinct genomes [50]. 

For studying genome relationships on the basis of 
chromosome pairing in autoallotriploid hybrids, the 
mode of synthesis of these hybrids is critical. To 
obtain realistic data, autoallotriploids should be 
synthesized by crossing a synthetic autotetraploid 
of one species, produced through the use of a 
chromosome doubling agent, with the diploid of the 
other. Thus, to synthesize JJE hybrids, synthetic JJJJ 
autotetraploid should be crossed with the diploid 
EE. Similarly, to obtain JEE hybrids, synthetic EEEE 
should be hybridized with the diploid JJ. The 
synthetic autoallotriploids thus synthesized will 
have pure genomes, unadulterated by the homo- 
eologous pairing and recombination that are charac- 
teristic of naturally occurring polyploids. Crosses of 
diploid species with naturally occurring tetraploids 
have been studied [53] but this approach may have 
problems since the naturally occurring tetraploids 
may have undergone homoeologous recombina- 
tions between the constituent genomes, thus alter- 
ing the pattern of chromosome pairing in hybrids 
with their ancestral diploids. Naturally occurring 
tetraploids may have developed a genetic control of 
chromosome pairing [12,54,55] and therefore the 
pairing patterns in synthetic hybrids involving such 
tetraploids may be different from those in hybrids 
derived by crossing a diploid species with a syn- 
thetic autotetraploid. 


32.5.3 Amphidiploids: opportunity for 
preferential pairing 


Amphidiploids (4x) are obtained by chromosome 
doubling of diploid F1 interspecific or intergeneric 
hybrids (2x). The rationale for studying chromo- 
some pairing in amphidiploids is the same as for 
autoallotriploids described above. If each of the two 
genomes is represented twice (for example, AABB), 
the genomes should maintain their meiotic integrity 
and pairing should be limited to homologous 
partners if the genomes are distinct. In other words, 
preferential pairing will lead to diploid-like meiosis 
and, hence, high fertility. However, if the genomes in 
question are closely related, the amphidiploids will 
form quadrivalents in addition to bivalents, and 
have varying degrees of sterility. 

The fertility of an amphidiploid depends upon the 
degree of differentiation of the constituent genomes, 
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which is evidenced by the degree of pairing in the 
parental diploid hybrid. There is generally a close 
inverse relationship between 2x pairing and 4x 
fertility [56], and between 2x chiasma frequency and 
4x bivalent frequency [57]. In other words, the 
greater the divergence between the two genomes 
making up a hybrid, the greater the fertility in the 
derived amphidiploids. Thus, we have observed 
that a poor chiasma frequency in the diploid hybrid 
QE) Thinopyrum bessarabicum x Lophopyrum elonga- 
tum, yielded a preponderance of bivalents in its 
amphidiploids (JJEE), which therefore were meioti- 
cally and reproductively stable and fertile [21,50]. 
These studies show the divergence of the J and E 
genomes. 


32.5.4 Other limitations of chromosome pairing 


In addition to the limitations of chromosome pairing 
discussed above, one may encounter some other 
problems. 


32.5.4.1 Effect of environmental factors 

The amount of pairing and chiasma frequency can 
be influenced by environmental factors such as 
temperature [58]. Therefore, hybrids being studied 
or compared should be grown under normal, 
uniform conditions. If the hybrids being compared 
are grown under very different environments, 
chromosome pairing data obtained from them may 
lead to erroneous conclusions on genomic affinities. 


32.5.4.2 Effect of genotype 

Chromosome pairing promoters and inhibitors [54] 
can influence the quantity of pairing and thus alter 
the pattern of affinity among genomes combined in 
hybrids. The amount of pairing in hybrids may be 
influenced by genotypes of the parents. Genotypes 
of some wild grasses when crossed with wheat 
suppress, in the hybrids, the activity of the homo- 
eologous pairing suppressor gene, Ph1, and thus 
bring about unexpectedly high homoeologous 
pairing [59,60] (see also Section 32.6.2). Such hybrids 
do not offer sound conditions for interpreting 
genomic relationships. 


32.5.4.3 Asynaptic or desynaptic phenomena 

Asynapsis (total lack of chromosome pairing) and 
desynapsis (precocious separation of initially paired 
chromosomes), although rare, are generally under 
genetic control. Several major genes which, in the 
homozygous recessive condition, bring about fail- 
ure or disruption of pairing are known [61]. These 
pairing variants may also be caused by environ- 
mental factors. They may prove to be impediments 
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to genome analysis because lack of pairing in the 
asynaptic/desynaptic hybrids may be misinter- 
preted as being due to lack of pairing affinity 
between the parental chromosomes. An experienced 
cytogeneticist should, however, not be misled or 
biased by such rare situations because they can be 
recognized at diakinesis or early metaphase I of 
meiosis. 


32.5.4.4 Possible cytoplasmic effects on 
chromosome pairing 
There are some indications of cytoplasmic effects on 
chromosome pairing in hybrids (ref. 55 and G. 
Peterson and O. Seberg, personal communication 
1994). In such cases, the amount of pairing in a hybrid 
would depend upon whether a particular genotype is 
used as a female or male parent. Failure to study 
reciprocal hybrids may produce erroneous results. 
Despite all these limitations, genome analysis by 
studying chromosome pairing has yielded extreme- 
ly useful information. In the tribe Triticeae, for 
example, this technique has been successfully 
employed for almost six decades and has revealed 
phylogenetic relationships that have been borne out 
by other criteria. 


32.6 Chromosome pairing in hybrids 
in the presence of Ph7 


Common wheat (T. aestivum) is an allohexaploid 
with three genomes, AA, BB and DD, derived from 
its three wild ancestors. Although the corresponding 
chromosomes of the three genomes are closely 
related, a gene called Ph1 (located in the long arm of 
chromosome 5B) suppresses pairing between chro- 
mosomes of different genomes [54,62]. In other 
words, Ph1 suppresses homoeologous pairing so 
that only homologous chromosomes can pair. This 
gene also suppresses pairing between homoeolo- 
gous chromosomes of alien genomes introduced 
into wheat. 


32.6.1 Advantages of Ph1-regulated 
chromosome pairing 


Pairing or lack of pairing between chromosomes of 
two genomes in the presence of Phi in the wheat 
background would provide a crucial test of their 
relationship. If chromosomes of two putatively 
related genomes pair with each other in the presence 
of Ph1, thereby passing its limits of discrimination, 
then they may be considered to be essentially 
homologous. Thus, the presence of Ph1 provides 
a rigorous test of homology and may facilitate 
assessment of genome relationships. This approach 
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has been applied to the study of the relationship 
between the J genome of Thinopyrum bessarabicum 
and the E genome of Lophopyrum elongatum. Forster 
and Miller [63] studied chromosome pairing in the 
AABBDDJE hybrids derived by crossing wheat-Th. 
bessarabicum amphidiploids (AABBDDJJ) with 
wheat-L. elongatum amphidiploids (AABBDDEE). 
The AABBDDJE hybrids showed mostly 21 
bivalents (II) + 14 univalents (I). Clearly, the J- and E- 
genome chromosomes do not pair in the presence of 
Ph1. Essentially similar results were obtained in 
trigeneric hybrids (ABJE) involving durum wheat 
(T. turgidum, AABB), Th. bessarabicum (JJ), and L. 
elongatum (EE) [21,64]. It was therefore concluded 
that the chromosomes of J and E genomes are 
homoeologous at best [20,21]. 


32.6.2 Limitations of Ph1-regulated 
chromosome pairing 


A limitation of the Phi-regulated pairing is that 
sometimes, in the presence of more than one dose of 
Phi, even homologous chromosomes present in 
more than two copies may pair as bivalents. This is 
exemplified by the formation of 22 bivalents in a 
large proportion of pollen mother cells (PMCs) 
in wheat tetrasomics (2n=44), which have four 
copies of one chromosome. Because of the genetic 
control of diploid-like pairing, hexaploid tall fescue 
(2n =6x =42; AABBCC) forms 21 II [65]. However, in 
the colchicine-induced dodecaploid tall fescue 
(2n=12x=84; AAAABBBBCCCC), which has four 
doses of each of the chromosomes, a preponderance 
of bivalents is observed instead of the expected 
quadrivalents [66]. It seems that the pairing control 
genes in four doses not only suppress homoe- 
ologous pairing, but also bring about bivalent 
pairing of the homologous sets. These are, never- 
theless, exceptional situations and should not 
deter one from using the Ph-regulated pairing (e.g. 
in the wheat background) to decipher genome 
relationships. 

Another limitation of the use of this tool is that the 
activity of Ph1 can be suppressed in the hybrids by 
the genotype of the parental species. Ae. speltoides, 
for example, is known to suppress the activity of Ph1 
in its hybrids with wheat, resulting in high homoe- 
ologous pairing [67]. Similarly, certain genotypes of 
Agropyron cristatum (L.) Gaertner switch off or 
reduce the activity of Ph1 in hybrids with wheat and 
thus bring about pairing among homoeologous 
chromosomes [59,60]. Chromosome pairing in such 
hybrids may not be used for deducing genome 
relationships because even relatively less related 
chromosomes pair in the absence of Ph1. 


SCCAHSHOHOASSOHOSHSO SDT ST SSSSAASEHHGHEHHSHHHESHHE HEFT EASHOHHOES 


32.7 Discrimination of pairing 
between parental chromosomes 


Chromosome pairing in hybrids may be due to 
autosyndesis, that is pairing within a parental com- 
plement, and/or allosyndesis, that is pairing be- 
tween the parental chromosomes. Therefore, for 
assessing genome relationships, the nature of chro- 
mosome pairing should also be analysed because 
only the degree of pairing between parental chromo- 
somes is indicative of the degree of relationship 
between parental species. The following criteria 
may help study specific pairing. 


32.7.1 Size difference between 
parental chromosomes 


A distinct size difference between chromosomes of 
parental diploid species can help estimate, in their 
diploid hybrids, the degree of autosyndetic (intrage- 
nomic) and allosyndetic (intergenomic) pairing. The 
intergenomic bivalents are generally heteromor- 
phic, whereas intragenomic bivalents are relatively 
homomorphic, either large or small depending 
upon the size of parental chromosomes [68,69]. We 
[68] studied both auto- and allosyndetic pairing in 
interspecific hybrids between Pennisetum typhoides 
Stapf et Hubb. (2n=14 large chromosomes) and P. 
purpureum Schum. (2n=28 relatively smaller 
chromosomes) and deduced genomic relationships. 
The nature of pairing was also studied in hybrids 
between P. typhoides and P. orientale Rich. (2n=18 
relatively smaller chromosomes) [69]. A certain 
degree of error may be involved in discriminating 
the parental chromo-somes in such hybrids. 


32.7.2 Use of marked chromosomes 


Cytologically marked and hence easily distinguish- 
able chromosomes may be used to study pairing 
between specific chromosomes. Thus, a telocentric 
chromosome is readily recognizable in both somatic 
and meiotic plates and provides a ready marker to 
study pairing behaviour of particular chromosomes. 
Using double-double telocentric wheat, in which 
two homoeologous chromosomes were marked by 
their two separate telocentric arms, Alonso and 
Kimber [70] studied intergenomic chromosome 
pairing in wheat x Aegilops hybrids (ABDS). They 
found that the pairing frequencies of the B and S 
genomes were similar to those of the A and D. 

This strategy of studying pairing between specific 
chromosomes may have certain limitations. Because 
a telocentric chromosome lacks one arm, the pairing 
potential of the entire chromosome will seldom be 
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realized because recombination is strongly reduced 
near the centromere [71]. 


32.7.3 Meiotic chromosome banding 


Distinct banding patterns of parental chromosomes 
are very helpful in studying pairing specificity 
between them in the hybrids of allopolyploid 
parental species. Meiotic C-banding of wheat and 
rye hybrids suggested pairing between chromo- 
somes of the A and D genomes [72-74]. We have 
found that that N-banding analysis of meiosis in 
Ph1-deficient wheat euhaploids (2n =3x=21; ABD) 
offers an excellent means of elucidating pairing 
relationships among chromosomes of the A, Band D 
genomes without the competitive (or genetic) inter- 
ference of an alien genome [62]. Chromosome arms 
involved in rod and ring bivalents were identified. 
Various chromosomes involved in bivalent and 
trivalent configurations were also identified. We 
showed that almost 80% of the metaphase I 
associations were between chromosomes of A and D 
genomes, indicating that these genomes are more 
closely related to each other than either one is to the 
B genome. 


32.8 Application of mathematical 
models to meiotic pairing data 


Giemsa banding of meiotic chromosomes may 
prove very useful in assessing the degree of pairing 
relationship between chromosomes of specific 
genomes. However, in the absence of banded or 
otherwise marked chromosomes, an acceptable— 
although relatively less reliable—substitute is the 
fitting of numerical models to the observed meiotic 
associations. The relative pairing affinities among 
chromosomes of the constituent genomes in poly- 
ploid hybrids can be assessed by applying appro- 
priate mathematical models [75-80]. 

These theoretical models for deriving estimates of 
relative pairing affinities between chromosomes of 
different genomes are based on several assumptions. 
The models of Kimber and Alonso [77], for example, 
make a number of stringent assumptions about the 
chromosomes and the pattern of genomic affinity: 

1 the long and short arms have equal chiasmate 
binding and have the same pattern of synapsis, 

2 there are only two levels of genomic affinity, 
linked, respectively, to the proportioning constants x 
for the closest relative affinity and y for the remain- 
ing relative affinities. 

Before applying a specific model, it is necessary to 
check if the assumptions of the model are met so that 
one does not arrive at erroneous conclusions. For 


example, pronounced differences in chiasma fre- 
quency between chromosome arms make the 
models of Kimber and associates less suitable. 


32.9 Protein electrophoresis 


Alcohol-soluble prolamins constitute a major stor- 
age protein fraction in cereals, such as the gliadins in 
wheat, avenins in oats, zeins in maize, and hordeins 
in barley. Electrophoretic similarity of seed gliadin 
proteins, for example, provides a reliable measure 
of species affinity [81-83], because the banding 
patterns of these proteins are not influenced by the 
environment [84], although their quantity is easily 
affected. Konarev et al. [85] identified two major 
groups of gliadin proteins, based on their relative 
mobilities, and these proteins have been used to 
assess species relationships. We have found, for 
example, that Thinopyrum bessarabicum and Lophopy- 
rum elongatum can be identified on the basis of their 
distinct gliadin profiles. Seed-protein profiles have 
also been employed for assessing genome relation- 
ships between cultivated rice species and their wild 
progenitors [86]. 

The gliadins or other seed proteins may be limited 
in their ability to differentiate among genomes 
because only a small portion of the genome is 
analysed compared to the other techniques dis- 
cussed above. Another limitation of this tool is the 
availability of seed. Moreover, the protocol for 
gliadin electrophoresis has several variables and 
may not always give consistent results. 

Isoenzymes or isozymes—multiple molecular 
forms of enzymes—have been used to study phy- 
logeny, genomic relationships, and synteny relation- 
ships among loci in related genomes in several plant 
groups [87-89]. They have, for example, played a 
valuable role in determining homoeologous rela- 
tionships among the chromosomes of the tribe 
Triticeae. Although isozymes can serve as excellent 
markers for particular chromosomes, other mole- 
cular markers such as restriction fragment length 
polymorphisms (RFLPs), being essentially unlim- 
ited in number, offer more advantages as markers. 
Therefore, isozyme markers will probably have a 
limited role in comparative genome mapping in the 
future. 


32.10 Molecular tools of 
genome analysis 


In the past decade, major advances have been made 
in the molecular understanding of plant genomes. 
Molecular cytogenetic methods can help associate 
particular DNA sequences with particular sites on 
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the chromosomes, thus helping to understand 
aspects of genome organization. Techniques are now 
available in which nucleic acid probes are used to 
locate specific nucleic acid sequences by in situ 
hybridization. Labelled DNA probes are hybridized 
to chromosomes after appropriate treatments, and 
the chromosomal regions that bind the probe are 
visualized (see, for example, Chapter 9). 

In situ hybridization has proved valuable in 
studying genome relationships. In genomic in situ 
hybridization (GISH), total genomic DNA is used as 
a probe to chromosome spreads [90] or Southern 
blots [91]. This technique permits identification of 
the genomic origin of chromosomes from different 
species in allopolyploids, synthetic hybrids, and 
alien addition, alien substitution and translocation 
lines in various species. The method gives useful 
information about the similarities of DNA from 
related species. It is sometimes referred to as 
chromosome painting (see Chapter 10) because the 
probe is labelled with a fluorescent dye and the 
chromosomes detected by the probe thus become 
uniformly coloured. GISH has been successfully 
employed to study genomic relationships in com- 
mon wheat, tobacco and other allopolyploid plant 
species and hybrids [91-93]. 

Fluorescence in situ hybridization (FISH) (Chapter 
9) is another powerful tool for chromosome 
mapping and for analysing genome relationships. 
FISH has been used to detect parental genomes in 
hybrids [94] and allopolyploids [95], and for detect- 
ing alien chromosome segments in translocations 
[96-98]. Multicolour FISH has also been used to 
simultaneously discriminate several genomes in 
allohexaploids such as wheat [99]. Using a com- 
bination of Giemsa banding and FISH, Jiang and Gill 
[100] found a 4A-5A-7B cyclic translocation specific 
to T. turgidum (2n=4x=28; AABB) and a different 
cyclic translocation in T. timopheevii (2n=4x =28; 
AAGG), thereby supporting the diphyletic origin of 
tetraploid wheats. 

In situ hybridization can help detect some chro- 
mosomal landmarks (see also Section 32.3.2), which, 
in turn, can be used for phylogenetic investigations. 
The diploid T. monococcum is known to have two 
pairs of satellite chromosomes with NOR loci [101, 
102]. Recent in situ hybridization analysis using a 
18S-26S rDNA probe detected a new NOR locus at 
the telomere of the long arm of chromosome 5A in T. 
monococcum ssp. boeoticum as well as in ssp. urartu, 
which is also present in chromosome 5A of T. 
turgidum and T. aestivum, and in 5A‘ of T. timopheevii 
[103]. 

Molecular data generated from isozyme analyses 
and oligonucleotide fingerprinting have been used 


in genome analyses. Molecular marker-assisted 
genome analysis also offers certain advantages. 
Thus, RFLPs [104] and random amplified poly- 
morphic DNAs (RAPDs) (refs 105 and 106; see also 
Chapter 5) have been successfully used for both 
genome analysis and mapping. The advent of 
molecular marker technology, using isozymes and 
RFLPs, has particularly revolutionized the linkage 
mapping strategies by providing easily mappable 
biochemical markers in addition to the classical 
morphological genes [107]. The molecular maps of 
several crop plants have been or are being con- 
structed and are already proving useful in plant 
breeding programs in terms of locating genes of 
economic interest, and tagging and tracking genes to 
facilitate their transfer to desirable cultivars. Map- 
based technology is also useful in studying genome 
evolution and in revealing unusual synteny rela- 
tionships among distantly related species and 
genera [108-111]. Thus, the RFLP map of the potato 
is almost identical in the order of markers with that 
of tomato [108]. One particular problem with RFLP 
analysis is that most current methods used for plants 
require the use of radioisotopes and may not be cost- 
and time-effective. 

The RAPD-PCR technique (random amplification 
of polymorphic DNA by the polymerase chain 
reaction) provides a means of rapidly detecting 
polymorphisms for genetic mapping and strain 
identification [112,113]. The ability of RAPDs to 
survey numerous loci in the genome makes this 
technique particularly attractive for studying 
genetic distance and for phylogeny reconstruction. 
However, the method has several limitations [114]. 
The RAPD technique is useful in genetic analysis 
only if variation in banding patterns represents 
allelic segregation at independent loci. Polymor- 
phism is detected as band presence vs. absence and 
may be caused either by failure to prime a site 
because of nucleotide sequence differences or by 
insertions or deletions in the fragments between two 
conserved primer sites. Intermittent PCR artefacts 
may sometimes be misread as true allelic segre- 
gation [115]. 

Other technical problems and some possible 
solutions have been outlined by Hadrys et al. [116]. A 
possibly serious problem with RAPD technology is 
its reproducibility, which would limit the com- 
parison of RAPD analysis data from one laboratory 
to another. DNA fragments that were amplified by 
five primers and shown to be reproducibly poly- 
morphic between two oat cultivars (at the Agri- 
culture Canada, Ottawa laboratory) were tested in 
six other laboratories in North America. Four of 
these participating laboratories amplified very few 
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or no fragments using the Ottawa protocol, although 
the same participants were able to generate a 
considerable number of amplified fragments using 
their own protocols [117]. Unless results obtained in 
one laboratory are reproducible both within and 
among laboratories, the potential benefits of RAPD 
technology will not be realized. This underscores the 
need for the use of uniform protocols when results 
are compared. 


32.11 Conclusion and perspectives 


Several techniques of classical cytogenetics and 
biochemistry have been successfully employed for 
studying genome relationships both within poly- 
ploid taxa and among various plant species. Each of 
these tools has its own merits and also certain 
problems associated with it. Nevertheless, major 
advances have been made in understanding genome 
relationships in numerous plant taxa, and _ this 
knowledge has helped in planning germplasm 
enhancement strategies in several crop plants and 
other species of economic value. Molecular maps 
of several crop plants are being constructed (see 
Chapters 33 and 34) and are already having an 
impact on plant breeding in respect of locating, 
tagging, and tracking genes of economic value. 

The development of chromosome microdissection 
and microcloning techniques (see, for example, 
Chapter 11) has provided powerful tools for mole- 
cular analysis of the human genome [118,119]. Some 
plants, particularly those with small chromosome 
number but large chromosome size, such as rye (S. 
cereale L.) and barley (H. vulgare L.) may be amenable 
to microdissection. Like several other molecular 
genetic techniques that were first developed in 
human cytogenetic studies and have later been 
adopted for plants (e.g. flow cytometry, RFLPs, 
microsatellites, and FISH analysis), it is anticipated 
that chromosome microdissection and microcloning 
will soon find an application in plant genome 
analysis, particularly for fine structure physical 
mapping. This new technology may yet open an 
exciting new era in characterizing the molecular 
structure and organization of plant genomes. 

In genome analysis by classical cytogenetics, a 
genome is considered to be a static entity. However, 
a genome of a plant species can be dynamic, that is, 
it can undergo structural changes spontaneously. 
Genome-restructuring genes have also been report- 
ed [120] and chromosome structural changes in one 
species can be induced by an alien chromosome 
introduced into it [121]. Chromosome mutations 
frequently occur in lines of common wheat with the 
addition of certain chromosomes, called gameto- 


cidal chromosomes, from its wild relative Aegilops 
[122,123]. The study of restructured genomes thus 
created, or of reconstituted genomes produced 
through plant breeding, cannot be carried out using 
traditional cytogenetic tools. However, during the 
past decade, molecular cytogenetic techniques have 
been developed that have added new dimensions 
to the study of genome relationships. Both GISH 
and FISH techniques can be successfully employed 
to study restructured genomes, for elucidating 
evolutionary changes in genomes, and even for 
identification of parts of chromosomes involved in 
translocations. Although most of the techniques 
outlined above have contributed useful inform- 
ation on genome relationships, a multidisciplinary 
approach to genome analysis is always preferable 
for obtaining a full picture of genome relationships 
within and between species. 
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33.1 Introduction 


Plant species constitute a large proportion of living 
organisms, and all animals, including humans, are 
completely dependent on higher plants for their 
nutrition. As a consequence, improving crops has 
been a major activity for mankind since the very 
beginning of the human ‘adventure’. In most coun- 
tries, agriculture and plant breeding are among the 
principal economic resources and it is essential to 
improve our knowledge of plant genomes in order 
to progress in plant breeding. Towards this goal, 
many basic questions have to be addressed. How do 
we identify plant genes and determine how they are 
organized on the chromosomes? Which genes con- 
trol agronomic traits? How can we improve classical 
breeding using the information obtained from in- 
depth studies of plant genomes? How is recombina- 
tion capacity controlled? How can we reintroduce a 
new gene into a plant genome and also target its 
localization and expression? 

Independently of these practical goals, higher 
plants have a number of specific properties and pro- 
cesses which suggest that novel genes and biological 
regulatory mechanisms could be uncovered by 
analysing their genomes. Examples of such pro- 
cesses are the structure and assembly of the cellulose 
cell wall, photosynthesis, nitrogen and carbohydrate 
assimilation, production of secondary metabolic 
products, perception of light, morphogenetic re- 
sponses and response to environmental stress. 

When trying to answer the questions posed 
above, plant breeders and plant physiologists have 
been faced with two problems. The first is the large 
number of crops that have to be studied for 
economic reasons. The second is the wide range of 
complexity of plant genomes, which range over two 
orders of magnitude [1-3] from a few hundred 
megabases up to 30 000 Mb (Table 33.1). To overcome 
these difficulties, plant scientists have focused their 
genome-sequencing projects on two species with 
reasonably sized genomes: a dicotyledon, Arabi- 
dopsis thaliana, and a monocotyledon, rice (Oryza 
sativa). These two plants have rather different habits 
and morphology. They also have very distinct codon 
usages. In addition to these two species, genomes 
from more than 40 economically important crops 
have been mapped by RFLP. This exceptionally 
favourable situation results essentially from the 
facility with which plants can be crossed and their 
progenies analysed. 

Arabidopsis thaliana (or thale cress) is a small weed 
(see Plate 10) which is being used as a model plant 
species by a large part of the scientific community. It 
belongs to the family Cruciferae, which contains 


Table 33.1 Comparison of genome size in plant species 
with other genomes analysed in sequencing programmes. 


Organisms Size (Mbp) 
Saccharomyces cerevisiae 15 
Caenorhabditis elegans 100 
Homo sapiens 3500 
Arabidopsis thaliana 145 
Oryza sativa 440 
Brassica oleracea 600 
Lycopersicon esculentum 1000 
Zea mays 2500 
Hordeum vulgare 4800 
Triticum aestivum 16000 
Tulipa officinalis 30 000 


several major crops such as oilseed rape, mustard 
and cabbage. Arabidopsis has several key advantages 
over other plant species for genome analysis. The 
first is that its genome is one of the smallest in 
flowering plants. A second advantage is its short 
generation time (no more than two months in 
optimal conditions from seed to seeds). It is self- 
fertile and its progeny is abundant (more than 1000 
seeds can be harvested from a single plant and a 
single fruit (silique) contains about 30 seeds). Many 
different ecotypes with distinct characters have been 
selected as sources of genetic variability. This set of 
features allows detailed genetic analysis, so that it is 
possible to combine genetic mapping of visible 
markers with molecular marker mapping. Because 
Arabidopsis seeds are rather small (1000 seeds weigh 
20-22 mg), they are easy to mutagenize and a large 
number of mutants obtained by different methods 
have been described. Finally, Arabidopsis can be 
easily transformed using agrobacteria, opening the 
way to reverse genetic analysis for characterizing 
genes. Several reviews on Arabidopsis are available 
[4-9] and two textbooks of methods for Arabidopsis 
research have recently been published [10,11]. 

In this review, we shall describe the present status 
of our knowledge as well as the strategies which are 
being used to characterize the genome of this plant 
at increasingly greater resolution. Some of these 
strategies are specific to the analysis of plant gen- 
omes while others are common to other genomes. In 
addition, we shall discuss some of the implications 
of studies in Arabidopsis for finding human genes 
and analysing their function. 


33.2 Size of the genome and 
general organization 


Arabidopsis thaliana, like all other higher plant 
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species, has three distinct genomes. The chloroplast 
genome is ~150 kilobase pairs (kbp) and is very 
similar to those of the other plants in which it has 
been completely sequenced (tobacco, rice and pine) 
[12]. The mitochondrial genome is ~370 kbp and is 
one of the simplest in plants. More than 75% of its 
sequence has now been determined [13]. In leaves, 
the chloroplast genome can account for as much as 
27% of total DNA, whereas the mitochondrial 
genome usually represents less than 2% [12,14]. 

The Arabidopsis nuclear genome is organized in 
five chromosome pairs which are very small (on 
average, 2j1m) and are almost indistinguishable 
from each other by optical microscopy of metaphase 
plates [15-17]. However, preparation of synaptone- 
mal complex complements for electron microscopy 
has recently allowed the analysis Arabidopsis chro- 
mosomes at high resolution [18]. Three types of data 
provide an estimation of the size of this genome. All 
these approaches have been used to estimate the size 
of other genomes and are not specific to plants. On 
the basis of DNA renaturation kinetics, the haploid 
genome size was initially estimated to be 70 
megabase pairs (Mbp) [14]. Because the size of the 
Escherichia coli genome, which was used as a stan- 
dard in these experiments, has been re-evaluated, 
the actual size is probably close to 100 Mbp. More 
recently, Goodman and coworkers fingerprinted 
=17000 cosmids which could be grouped into 750 
contigs covering =91 Mbp [19]. Finally, extensive 
cytometry measurements have been made ona large 
number of plant species using highly standardized 
conditions. The estimate for Arabidopsis was 145 Mbp 
[3]. Although this size might be overestimated, the 
actual genome size is probably slightly higher than 
100 Mbp, which is comparable to that of other model 
organisms such as Drosophila melanogaster or Caeno- 
rhabditis elegans (see Table 33.1). This value has now 
been confirmed by physical mapping of individual 
chromosomes [20,21]. 

The initial DNA renaturation kinetics indicated 
that the Arabidopsis genome contained relatively 
few repeated sequences [14]. Highly repeated and 
foldback sequences comprise no more than 10% of 
the genome and another 10% is made of moderately 
repeated sequences. This gives Arabidopsis a con- 
siderable advantage as a model over other plants 
because these percentages are usually much higher: 
for instance, more than 80% of wheat DNA is made 
of repeated sequences [1,2]. Several tandemly 
repeated DNA families have been identified, cloned 
and sequenced. The genes encoding the 25S and 185 
cytoplasmic rRNA are organized as 10 kbp tandem 
repeats. With ~570 copies, they belong to the class 
of moderately repeated DNA. They exhibit some 


heterogeneity and were estimated to represent 7-8% 
of the DNA [14,22]. They have been located at two 
loci on the short arm of chromosome 4 and on the 
distal part of the shortest chromosome, chromosome 
2 [16-18,23]. There are about 1000 copies of the 5S 
rRNA genes [24], with at least one locus on chromo- 
some 4 [20]. Three other families of tandemly 
repeated DNA sequences have been described. They 
are members of the highly repetitive DNA 
sequences. The first comprises four to 6000 copies of 
a 180-bp unit which has been located essentially in 
the centromeric regions [25,26]. It accounts for about 
1% of the genome. The two other types of repeats are 
made up of a 500-bp and a 160-bp unit, respectively 
[26]. Each constitutes around 0.3% of the genome. 
The 500-bp repeat is related to the 180-bp satellite 
DNA, but the 160-bp repeat corresponds to a com- 
pletely different sequence. The last tandemly 
organized repeat to be described is the telomeric 
sequence which is made of around 350 copies of a 7- 
bp motif, CCCTAAA, as in many organisms. This 
block is present at each end of each individual 
chromosome [27]. Similar sequences have also been 
located in the chromosome 1 centromeric region 
[28]. In contrast to the satellite DNA sequences, 
which are species specific, the telomere sequences 
are highly conserved in all plant species studied so 
far. 

The renaturation kinetics data also tell us that 
there is relatively little interspersion of repeated 
sequences with single-copy sequences [14], an 
observation which has been confirmed by physical 
mapping of chromosome 4 [20]. The most interest- 
ing type of dispersed repeated sequences are the 
transposable elements. Several subclasses of retro- 
transposable elements have been described by 
Ausubel’s group, such as the Tal element [29,30]. 
Another element, Tat1, with the properties of a 
transposon, has been isolated and detected as a 
linear extrachromosomal molecule [31]. Neither of 
them seem to be active and they are usually present 
in rather low copy number. Recently, an active 
element similar to the maize Ac transposon [32] as 
well as anew retrotransposon, Athila [33], have been 
discovered. Athila, which is 10.5 kbp long, is similar 
to the Ulysses element from Drosophila viroids. There 
are as many as 30 copies in the genome. The 
remainder of the sequence is essentially single-copy 
or low-copy number sequence. The next steps in the 
description of the Arabidopsis genome organization 
are the achievement of genetic and physical maps 
and the identification of the genes. 

The Arabidopsis genome also contains simple 
repeat sequences such as microsatellites [34,35] (see 
Chapter 5). Minisatellites have been described 
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recently [36]. In addition, it has recently been shown 
that the telomeric sequence is also dispersed along 
the chromosomes and statistically over-represented 
in the 5’ noncoding regions of the genes [37]. 


33.3 Arabidopsis genetic maps 


The first Arabidopsis genetic map to be produced was 
based on morphological markers. It resulted from 
many crosses between lines carrying mutations, and 
from the analysis of trisomic lines [38]. Now, more 
than 800 mutants have been discovered following 
saturation mutagenesis with ethyl methanesul- 
phonate (EMS), radiation and other mutagens [39]. 
For many mutants, several alleles have been 
isolated. These mutants include a wide variety of 
phenotypes corresponding to all aspects of develop- 
ment [39], responses to environmental biotic or 
abiotic factors, and metabolic pathways. Particu- 
larly impressive are the large number of mutants 
affected in embryo development [40,41,197], flower 
formation and setting [42], hormone sensitivity or 


deficiency and response to light [43,44] or in 
response to pathogens [45-47], most of these pro- 
cesses being largely specific to plants. More than 
300 mutations have now been mapped and the most 
recent information can be obtained from AtDB (an 
Arabidopsis thaliana database) [48]. A list of these 
mutations has recently been published as a progress 
report from National Science Foundation [49]. Many 
of them are available from the Ohio State University 
and Nottingham University Stock Centres and can 
be ordered by electronic mail or WWW (see Section 
33.6) 

Several RFLP maps have been published [50,51] 
based on F2 segregation analysis and using essen- 
tially anonymous genomic probes (A-phage or 
cosmid DNA). An attempt has recently been made 
to integrate them using common markers [52] and 
the JOINMAP program [53]. An example of the 
integrated map is shown in Fig.33.1 as obtained 
from the AtDB. The major drawback of F2 mapping 
populations of annual species is that they are not 
permanent and cannot be distributed. More recently, 
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Fig. 33.1 A view of the integrated genetic map of 
Arabidopsis. The map of chromosome 5 has been enlarged 
to show the details. The right-hand part of each panel 
shows the approximate position of mutations (e.g. hy5, 
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new permanent mapping populations, made of 
recombinant inbred lines, have been produced by 
two groups [54,55]. This strategy is specific to the 
plant kingdom, and its first application to genome 
mapping was developed in maize [56]. 

Recombinant inbred lines are obtained by selfing 
individuals from an F2 progeny for several 
generations until the F7 or F8. Thus, the initial F2 
segregation is fixed and each line is homogeneous 
and homozygous for the alleles that segregated in 
F2. Such populations can be used for ever, since each 
line is simply maintained by selfing. They can be 
distributed worldwide, thus allowing dispersed 
laboratories to contribute to the saturation of the 
maps. The population developed in the United 
States by the Dupont de Nemours company results 
from a cross between line W100F and Wassileskija 
ecotype and consists of 150 lines [54]. That devel- 
oped in the UK was made by crossing ecotype 
Landsberg erecta (La-er) with ecotype Columbia 
(Col) [55]. Both populations are available from the 
stock centres and the mapping data are maintained 
and updated by the authors of the map. By now, 
more than 400 markers have been located on the 
La-er x Col map with 64 RFLP markers in common 
with the two original maps. Data concerning 
these markers are available from AtDB. All these 
molecular genetic maps gave similar estimates of the 
genetic complexity of the Arabidopsis genome of 
=600 cM. Therefore the average centiMorgan is only 
170 kbp. 

The trend for the future is to map expressed 
sequence tags (ESTs), which are coming out of the 
systematic and partial cDNA sequencing pro- 
grammes in France [57,58] and in the USA [59] 
as well as new types of markers such as amplified 
fragment length polymorphisms (AFLPs) [60], 
microsatellites [35,36] or cleaved amplified 
polymorphic sequences (CAPS) [61,62]. AFLPs are 
obtained by digesting DNA with two restriction 
enzymes and adding adaptors; the next step consists 
in amplifying a subset of these fragments using 
primers overlapping the adaptors and extending 
two or three nucleotides (arbitrarily chosen) beyond 
the restriction site. The amplified fragments are then 
resolved on a sequencing gel and those which are 
polymorphic between the parental lines can be 
scored in the segregating population. The advantage 
of this fingerprinting technique is that several 
markers can be mapped at the same time. 

More than 150 AFLP markers have already been 
mapped using La-er x Col recombinant inbred lines 
[21,60]. Thirty polymorphic microsatellite loci have 
been mapped by Ecker’s group on the same lines 
[35] and corresponding primers can be purchased 


from Research Genetics Inc. (see Appendix III for 
address). A set of 18 CAPS has also been developed 
and is available from the same company [61]. Their 
map positions have been determined [62]. They are 
evenly dispersed on the genome and were derived 
from the sequences of mapped genes. This set of 
markers now allows one to map unambiguously any 
gene to one of the 10 chromosome arms in a single 
cross using a limited number of F2 progeny. 
Additional polymerase chain reaction (PCR)-based 
markers of this type should become available soon. 
The expectation is that within the next two years 
more than 1200 mapped molecular markers will be 
available with an average covering of one marker 
every 0.5cM or every 85kbp, which should trans- 
form tedious chromosome walking strategies into a 
straightforward ‘chromosome landing’ approach to 
isolate genes identified by a mutant phenotype. 


33.4 Physical map 


In order to be able to isolate genes corresponding to 
mapped mutations, it is necessary to establish phy- 
sical maps. The first attempt to organize a physical 
map of the Arabidopsis genome was by Howard 
Goodman’s group at Massachusetts General Hospi- 
tal (MGH), Boston. They ordered a cosmid library 
according to the strategy used for Caenorhabditis 
elegans (see Chapter 29). Almost 20 000 cosmids were 
fingerprinted following labelling of HindIII sites, 
further digestion with Sau3A and resolution of 
labelled DNA fragments by high-resolution poly- 
acrylamide gel electrophoresis. Overlapping cosmids 
were organized into contigs by image and computer 
analysis, allowing the organization of 750 contigs 
[19]. Many of these data are also publicly available 
through AtDB, and a series of cosmid clones are 
distributed by the Ohio Stock Center. Filling the 
gaps between these contigs would have required 
enormous additional work and another strategy 
was preferred, based on ordering YAC libraries. 
Three publicly available YAC libraries have been 
produced. They have relatively small inserts, in the 
range of 160 kbp [63-65]. They contain, respectively, 
2100, 2700, and 2200 clones and collectively repre- 
sent 10 genome equivalents. Two other libraries with 
larger inserts have been made [58,65] and several 
others are being developed. The library prepared by 
a collaboration between several French plant 
scientists and colleagues at the Centre d’Etude du 
Polymorphisme Humain, Paris (CEPH) shows an 
average insert of 420kbp, which is much more 
convenient for chromosome walking [58]. All these 
libraries have been made using pYAC4 or deriva- 
tives as vectors. The major difficulty in preparing 
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YAC libraries from Arabidopsis is the DNA concen- 
tration: there is 35 times less DNA in nuclei from this 
species than in human cells, and nuclei have to be 
prepared from protoplasts, which constitutes a 
serious limiting factor. The three publicly available 
libraries have been partially screened with 125 RFLP 
probes by a small group of laboratories in England 
and the United States. As a result, a first set of 296 
YACs covering ~30% of the genome was described 
[66]. In addition to this large-scale coverage of the 
genome, several groups are organizing contigs 
around their favourite locus [67-71]. 

Most of the European effort has dealt with 
chromosomes 4 and 5. By the end of 1993 more than 
85% of chromosome 4 was covered by 35 YAC 
contigs; due to the recent use of the new YAC library 
this number has now (early 1996) decreased to three 
and coverage outside of the centromere and 
ribosomal gene regions is now complete [20]. 
Chromosome 2 is covered by six overlapping 
contigs [72]. Two large contigs have been organized 
around the genes controlling late flowering —fca on 
chromosome 4 and Co on chromosome 5. The first is 
2200kbp long and covers ~19cM, whereas the 
second covers 2700kbp [68,69]. The strategy for 
organizing these contigs has been to generate end- 
specific probes from YACs identified by colony 
hybridization with an RFLP probe, using either 
plasmid rescue or inverse PCR. However, this 
method is limited both by the presence of repeated 
sequences and by the occurrence of many chimaeric 
YACs. Therefore it is of utmost importance to control 
each step of the walk by generating a YAC-derived 
RFLP probe and demonstrating linkage with the 
initial RFLP marker. The YAC clones can also be 
used as probes for analysing genomic DNA 
separated by pulse field gel electrophoresis. 

Several strategies are being developed to fill the 
gaps between the contigs. The first is to align the 
cosmid and YAC contigs. The second is to find more 
RFLP markers that identify new YACs. Finally, 
probes derived from the ends of the present YAC 
contigs are being used to identify overlapping YAC 
clones in the new libraries with larger inserts and 
less chimaerism. The combination of these strategies 
should lead in the very near future to a complete 
physical map of all five Arabidopsis chromosomes, 
which will considerably facilitate the sequencing of 
a complete plant chromosome and the isolation of 
any gene on this chromosome [21]. 

A library has also been made in a P1 phage vector 
[73] and, more recently, as bacterial artificial chro- 
mosomes (BACs) [74]. 


33.5 Strategies for gene identification 


33.5.1 Classical strategies 


The primary strategy to identify genes in plants has 
been the classical one, consisting of purifying the 
corresponding proteins, obtaining antibodies or 
amino acid sequence information and using these 
tools to isolate the corresponding cDNA and geno- 
mic clones. Many genes have also been picked up 
following differential screening of cDNA libraries, 
sequence determination and comparison with 
known genes in databases. Some examples illus- 
trating this strategy are given in refs 75-79. This is 
the situation for many developmentally regulated 
genes or for genes responsive to environmental 
changes. In many cases, no well-defined function 
can be identified. The major limitation on this 
strategy is that the probability of isolating a gene 
that is not abundantly expressed is rather low. This 
is indeed the situation with most regulatory genes 
and most genes of agronomic interest, such as those 
controlling plant morphology, time of flowering or 
resistance to diseases. Many of these genes could not 
have been isolated by classical techniques. 

Another relatively classic strategy consists in 
complementing E. coli Saccharomyces cerevisiae, 
mutants [80] or mammalian cell (see Table 33.2). 

Complementation of a known mutant is also a 
useful tool for confirming the identification of a 
clone on the basis of sequence homology. Its 
usefulness is, however, limited by the availability of 
a mutant and by the divergence between yeast, 
bacteria and plant genes. Functional complementa- 
tion of mammalian cells has also recently been used 
to isolate plant genes coding for plasma membrane 
proteins [101] and on apoptosis suppressor gene 
[102] (see Chapter 18 for general techniques in this 
area). Functional assays in Xenopus oocytes are also 
used to analyse transporter and ion channel genes. 
Finally, one can make use of genomic subtraction 
methods if deletion mutants are available. This 
strategy was illustrated with cloning of GA1, a gene 
involved in gibberellic acid biosynthesis [103]. 


33.5.2 Genetic approaches: map-based cloning, 
T-DNA and transposon tagging 


Owing to the small size of the Arabidopsis genome 
and the ease with which it can be transformed, two 
major genetic strategies have been developed. They 
are partially specific to plants with small genomes. 
The first is a map-based cloning approach. The 
gene of interest is identified by a mutation that can 
be located on the genetic map and flanked by several 
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Table 33.2 Examples of Arabidopsis genes cloned using a 
complementation strategy. 
ee ee ee 
In Escherichia coli 

RecA homologues [81] 

UvrB, UvrC homologues [82,83] 
5’-Phosphoribosyl-5-aminoimidazole synthetase [84,85] 
Glycinamide ribonucleotide (GAR) synthetase [85] 
GAR-transformylase [85] 

3-Methyladenine DNA glycosylase [86] 

cnx1 (Molybdenum cofactor biosynthesis) [162] 
Phosphoribosylanthranilate isomerase [165] 


In yeast 

Potassium transporters [87, 88] 

Glucose and sucrose transporters [89,90] 
Amino acid and peptide transporters [91-93] 
NTR1, nitrate transporter related [94] 

NH,‘ high-affinity transporter [95] 

Cdc2-P34 protein kinase [96] 

Chorismate mutase [97] 

Cycloartenol synthase [98] 
ATP-sulphurylase [99] 

Aspartate transcarbamylase [100] 

Orotate phosphoribosyltransferase /orotidine 5- 
phosphate decarboxylase [100] 

CDC48 [166] 

Mevalonate kinase [167] 

Sarl, Sec12 [168] 

Secl4 (M. Lepetit, personal communication 1995) 
ERD2 [169] 

AAT1 [170] 

In mammalian cells 

Plasma membrane integral proteins [101] 


molecular markers. When the genetic distance be- 
tween flanking markers is reasonably small a 
chromosome walk can be attempted. Once a contig 
spanning the mutation is organized, the region 
around the gene has to be narrowed down. This can 
be done in several ways: a new RFLP which 
cosegregates with the mutation can be found, the 
region covering the putatively identified gene can be 
transferred back to the mutant line by transforma- 
tion, which should lead to complementation of the 
mutation in the progeny of the transgenic plants and 
finally, a CDNA detecting the expected expression 
pattern can be isolated and used to correct the 
mutation in transgenic plants. The first success of 
this strategy was the cloning of the ABI3 locus [104] 
controlling sensitivity to abscisic acid, rapidly 
followed by that of the FAD3 locus [105] which 
corresponds to a C18:2 fatty acid desaturase. Now 
more than thirty important genes have been isolated 
from Arabidopsis following this approach (Table 
33.3). It is obvious that the more the genetic map is 
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saturated and the physical map complete and 
reliable, the easier this approach becomes. 

The second genetic approach is based on tagging 
the gene of interest with a piece of identified DNA. 
The plant biologist can use two tools: transposable 
elements as in Drosophila, or T-DNA derived from 
the Ti plasmid of Agrobacterium tumefaciens, the 
bacterium used for transferring foreign genes to 
plants. Most success has been achieved so far with 
the T-DNA approach pioneered by K. Feldmann and 
C. Koncz. The former developed a seed transforma- 
tion protocol which avoids plant regeneration from 
tissue culture and the associated somaclonal 
variation. The latter used more classical techniques 
for transformation of leaf discs and root cells. 
Although the frequency of transformed plants was 
low, it was manageable, and several thousand 
individual lines were regenerated and screened for a 
variety of mutants [106-109]. Tagging should be 
established by selfing the mutant and analysing its 
progeny both for the selectable marker (usually 
resistance to kanamycin or bialaphos) and the 
presence of the mutation. To be sure that the 
mutation is tagged, no sensitive plant should be 
found amongst the mutants when 200-300 plants are 
analysed. Sequences flanking the T-DNA insertion 
can be recovered by plasmid rescue or inverse PCR 
and can be used as probes to isolate the wild-type 
allele of the gene that has been disrupted by the 
insertion. When the gene has been isolated, it should 
be demonstrated by transformation that it can 
correct the mutation. 

During the last few years, several important 
Arabidopsis genes have been isolated using this 
strategy and several others will appear soon (Table 
33.4). This technique, although very powerful, has 
several drawbacks: the seed transformation method 
is poorly reproducible; there are often several T- 
DNA insertions in the same plant: and, although 
somaclonal variation is reduced, many mutants are 
not tagged by the T-DNA. The more classical 
transformation method has the same limitations, 
and in addition is more time consuming. Recently, 
another transformation technique [110,111], in 
which the whole plant is vacuum-infiltrated with 
Agrobacterium, has allowed the production of several 
thousand additional transgenic plants transformed 
with an improved vector. It was calculated that 
50000 such independent lines would be enough to 
saturate the genome with well-defined insertions 
and pick up virtually any gene for which a mutant 
phenotype can be observed. 

Such a collection of mutants will also be useful in 
determining the function of genes with no known 
function. In yeast, the strategy used to determine the 
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Gene Phenotype or enzyme affected 
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Table 33.3 List of genes for 
Reference which a map-based cloning 


Genes isolated by map-based cloning 


ABI1 Vegetative tissues, abscisic acid 
insensitivity 

ABI3 Abscisic acid insensitivity in seed 

AXRI Auxin resistance 

CONSTANS Flowering gene 

DET1 (FUSCA) De-etiolated 

ERT1 Ethylene resistance 

FAD3 C18 : 2 desaturase 

LEAFY Inflorescence determination 

PISTILLATA Abnormal flower 

RPM1 Disease resistance 

RPS2 Disease resistance 


(P. syringae race avr Rpt2) 


strategy is used. Most of the 

genes in the lower part of this 

[70,71] table have now been isolated, as 
well as several others. 

[104] 

[171] 

[172] 

[173] 

[174] 

[105] 

[73] 

[175] 

[176] 

[47] 


Other genes for which chromosome walking is in progress (cited in ref. 9) 


ABI2 Abscisic acid insensitivity 
ARAI Arabinose sensitivity 

AXR2 Auxin resistance 

DETZ De-etiolated 

EIN2 Ethylene insensitivity 

EIN3 Ethylene insensitivity 

EGA Late flowering 

FRI Flowering gene 

FWA Late flowering 

GA2 Gibberellic acid insensitivity 
GAI Gibberellic acid insensitivity 
GI Late flowering 

GNOM Embryo pattern 

MS1 Male sterility 

PHOL Phosphate translocator 
RPP5 Disease resistance (Peronospora) 
HEE Terminal flower 

TTG Transparent testa glabrous 


function of a new gene is gene disruption via site- 
directed homologous recombination [112]. This 
method can also be used to some extent with animal 
cells but is not yet working with plants. An 
alternative would be to screen DNA pools of the 
collection of transgenic lines by PCR, to identify the 
line which is interrupted in the new gene and search 
for abnormal phenotypes. The first results using this 
strategy have been reported recently [113]. Most of 
the transgenic lines that have been reasonably well 
analysed are available from the stock centres for 
further screening. 

The transposon tagging strategy is also very 
powerful, as demonstrated in maize, Antirrhinum 
and Drosophila. However, the problem with 
Arabidopsis was that, until recently [33], no active 
transposon had been identified in this species. This 
inconvenience could be overcome by introducing by 
T-DNA transformation a defective nonautonomous 
maize transposable element such as Ac/Ds or 
Enhancer/Inhibitor into one set of transgenic lines 


and an active transposase into another set of lines. 
When the two types of lines are crossed, the 
defective element is able to jump to another place. 
Because the transposase gene and the defective 
element are not linked, in most cases they would 
segregate in the progeny, and stable mutants, tagged 
with the defective nonautonomous element, should 
appear. The advantage of this method over the T- 
DNA tagging approach is that the defective element 
can be moved again by crossing with a line carrying 
the transposase gene, in which case revertants 
should be obtained. In addition, in the revertant, a 
footprint of the transposition event should be 
observed. Although several genes have been tagged 
in various laboratories, so far there have been very 
few reports of isolation of an Arabidopsis gene using 
this method [114-116]. The same proofs for tagging 
as in the case of a T-DNA tag should be obtained: 
mutation and insertion should cosegregate in the 
progeny when selfing the mutant; the mutation 
should be corrected by the wild type allele when 
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Table 33.4 List of genes isolated using T-DNA tagging. 
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Gene 


Phenotype or enzyme affected Reference 
AGAMOUS1 Abnormal flower [177] 
APETALA1 Abnormal flower [178] 
APETALA2 Abnormal flower [179] 
AXI159 Auxin independent [180] 
CER1 Wax biosynthesis (aldehyde decarboxylase) [181] 
CER2 Wax biosynthesis [182] 
Git Chlorate resistant (nitrate transporter) [183] 
CHLORATA42 Chlorophyll deficiency (protoporphyrin Mg” chelatase) [184] 
COP1 (FUSCA1) Constitutive photomorphogenesis [185] 
COP9 (FUSCA/7) Constitutive photomorphogenesis [186] 
CTR1 Constitutive ethylene response [187] 
DWARF1 Gibberellin deficient [188] 
EMB30 Embryo pattern (GNOM) [189] 
FAD2 C18 : 1 desaturase [190] 
FAD3 C18 : 2 desaturase [191] 
FAE1 Deficient in fatty acid elongation [192] 
FUSCA6 Defective seed germination [193] 
GA4 GA3 B-hydroxylase [194] 
GLABRA2 Trichome absence [195] 
GLABROUS1 Trichome absence, Myb-like [160] 
HY4 Abnormal flowering (blue light receptor) [196] 
LUMINIDEPENDENS Late flowering time [197] 
PATEIGRESS Abnormal plastid development [198] 
PEE Abnormal leaf development [199] 
PISTILLATA Abnormal flower [175] 
TOUSLED Abnormal flower [200] 


Note: at least 20 additional genes have been isolated using this strategy. 


isolated; and revertants should be obtained. Because 
transposons usually move only short distances on 
the same chromosome, they should be much more 
useful to generate mutations in a specific region of 
the genome provided that a collection of transgenic 
plants carrying a mapped defective element is 
available. 


33.5.3 Promoter trapping 


The use of T-DNA or transposon tags can only detect 
genes in which mutations cause a change in 
phenotype. Since many genes are usually present in 
more than one copy, it follows that many will not be 
detected using this strategy. In addition, some genes 
are not essential at all stages of development, and 
many mutations may be silent unless an appropriate 
screening procedure is designed. In order to over- 
come this problem, T-DNA vectors were constructed 
in which a promoter-less reporter gene is located 
close to one of the T-DNA borders. When such a 
construction is inserted near an active promoter it is 
then possible to detect its activity by looking for 
expression of the reporter gene. This strategy is 
being used by several groups to detect promoters 


that function specifically in one organ or one tissue 
[117,118,196]. A refinement of this strategy is the 
enhancer trap, using a transposable element rather 
than a T-DNA [120]. 


33.6 Sequencing cDNA: 
expressed sequence tags 


With the double aim of obtaining a better under- 
standing of gene expression in Arabidopsis and 
contributing to the genome mapping of this species, 
two groups of laboratories, one in France [57] and 
the other in the United States [59], have embarked 
ona project to partially sequence as many cDNAs as 
possible from various tissues. This will enable a gene 
to be identified by an expressed sequence tag (EST). 
The strategies and starting material differ to some 
extent between the two groups. The American group 
has been using the A-YES vector which allows 
almost direct transfer of the cloned cDNA to yeast, 
whereas the French groups have essentially used A- 
ZAP. 

The American group at Michigan State University 
has made a single orientated library by pooling 
mRNA prepared from different parts of the plant. 
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The rationale for this strategy is to prepare a 
normalized library in which each clone should be 
equally represented, regardless of the enormous 
differences in gene expression. Up to now, sequenc- 
ing has been from the 5’ end of each clone, using an 
automated sequencing apparatus. The French 
groups, which are dispersed in several places, have 
adopted a different approach and have made several 
orientated libraries corresponding to different 
tissues or culture conditions: young etiolated seed- 
lings, young green plantlets, flower buds, immature 
siliques, dry seeds, wounded adult leaves, cell 
suspension cultures and cell suspensions elicited by 
a bacterial pathogen. After an initial period during 
which only one end of the cDNA was sequenced, 
they have sequenced each new cDNA from both 
ends using automated machines. This strategy has 
several advantages: complete sequences for many 
more cDNAs can be obtained; and sequencing of the 
3’ untranslated region, which is usually gene- 
specific, allows the distinction between multigene 
family members for which the coding region is 
almost or completely identical. Studies on various 
libraries allow the expression patterns of a large 
number of genes to be determined from the 
frequency with which the genes are found in the 
different libraries. 

The redundancy of clones selected at random is 
highly variable, depending on the libraries under 
analysis [57]: for instance, 35% of the sequenced 
clones in the immature silique library correspond to 
three multigene families, including the two major 
families of storage proteins (napins and cruciferins). 
There are five members in the napin family and 
three in the cruciferin. Most of the other clones have 
been observed only once or twice, although there is 
evidence for small multigene families. In contrast, in 
the cell suspension library, only 8% of the clones 
have been found to be redundant. Although this 
redundancy is becoming a hindrance in deriving a 
catalogue of Arabidopsis genes, it is a reliable source 
of information on the relative expression of gene 
family members. When more clones are isolated and 
sequenced it should be possible to determine an 
expression pattern simply from the frequency of a 
sequence in different libraries. Each group has now 
isolated the representatives of the most abundantly 
expressed genes in its own situation. The libraries 
are being cleaned by hybridization with chara- 
cterized probes in order to overcome redundancy 
problems. An alternative way of finding new genes 
is to set up specific screening procedures which will 
replace random collection. 

The analysis and editing of the sequences makes 
up a large part of the work. It was decided at the 


beginning of the French project to centralize 
information and to eliminate, as far as possible, 
sequences with too many uncertainties, as well as 
identical sequences, before submission to the EMBL 
data bank. Further, a minimum of sequence editing 
and correction is carried out, particularly for 
homologues to known proteins, with the aim of 
submitting sequences which are as accurate as 
possible so that they are useful for the scientific 
community. The submission protocol defined in 
collaboration with Rainer Fuchs at EMBL, includes 
citation of protein or nucleic acid homologues and 
definition of probable coding regions, providing 
valuable information when database searches are 
carried out. 

Since September 1993 the program has been 
supported by the EC as part of the project European 
Scientists Sequencing Arabidopsis (ESSA) project. 
The ESSA project is funding only new sequences 
and an additional and independent verification 
for novelty is carried out at Martinsried by the 
Martinsried Institute for Protein Sequences (MIPS). 
Figure 33.2 shows the flow-chart for dealing with 
sequence data in use in our group and in most of 
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Fig. 33.2 Flow-chart for EST analysis and submission to 
databases. 
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the French groups. This protocol allows us to de- 
tect redundancies, rapidly resolve ambiguities in 
sequencing, and search for open reading frames 
with homologies in public databases. Thus, it is very 
often possible to extend useful sequence infor- 
mation well beyond the 300 nucleotide limit, after 
which the automatic analysis programme tends to 
insert spurious bases. Early alignment with the local 
database avoids repeatedly sequencing identical 
clones from both ends. Comparison with public 
databases allows one to detect significant similarity 
to known sequences from other organisms and thus 
to assign a putative function to the gene. When a 
disruption in an amino acid alignment but not in the 
nucleotide alignment is observed, this very often 
indicates a sequencing mistake and allows detection 
and correction of such errors. The most frequent 
errors are uncertainties (marked as N), misidentifi- 
cation of a base, and single base deletions or addi- 
tions due to compression beyond 300-350 bases. 
These problems can be largely limited by using 
high-quality DNA (e.g. Qiagen-treated minipreps; 
see Chapter 21) in the reactions. 

When a sequence has been corrected and edited, 
it is transferred to the consortium GDR database in 
Toulouse where it is compared with sequences pro- 
duced by other groups in the consortium; redundant 
clones or remaining poor-quality sequences are 
identified at this stage. They are added to the local 
database for statistical analysis but are not neces- 
sarily submitted to public databases. The selected 
clones are then returned to each laboratory for 
validation and annotation before being automati- 
cally transferred to the public EMBL database. 
Although the European Union (EU) is paying only for 
new clones, we continue to deposit some sequences 
corresponding to a previously described cDNA if 
they significantly extend the already available data. 
In addition, because most sequences correspond to a 
single run of the machine and are not 100% accurate, 
it is certainly helpful to have some redundancy in 


Table 33.5 Evolution of ESTs entries in dbEST. 


the database to track sequencing errors. The plant 
origin of the sequences was routinely assessed using 
a quality control algorithm [121]. 

Altogether, we estimate that at the end of 1996, the 
French and American consortia had already pro- 
duced more than 15000 non-redundant ESTs, and 
probably nearly 35000 ESTs if redundancy is not 
taken into account. By the end of February 1995 
there were 24352 entries for Arabidopsis ESTs in the 
centralized EST database at the National Institutes 
of Health, dbEST [122], comparing well with the rice 
and nematode programs (Table 33.5). During the 
first year of ESSA (September 1993-September 
1994), 710 new and original cDNAs (not previously 
tagged by either the French or American programs) 
were partially sequenced from both ends. Out of the 
5358 nonredundant ESTs contributed by the French 
group and released by the end of December 1995 in 
dbEST, 2144 showed a significant similarity at the 
protein level with other sequences in the databanks 
[123]. 

These similarities were generally established 
using the BLASTX program [124] and a score higher 
than 100 unless an obvious signature motif [125] was 
present, and were frequently improved using more 
sensitive programs such as TFASTA [126]. Another 
useful programme is the DOMAINER algorithm, 
which searches for homologous protein domains 
among the entries in SWISSPROT database [127]. 
Table 33.6 gives the statistics for the first 1152 
published ESTs [57]. More than 7000 ESTs have now 
been analysed [20,21] but the general trends are not 
significantly changed. 

It is striking that more than 60% of the genes do 
not yet have any homologues in the databases. With 
the effort made to identify genes in other organisms, 
this proportion is likely to decrease progressively in 
the future, as has been observed for yeast chro- 
mosome III [112]. Although it is difficult to make 
accurate comparisons because of the fact that some 
of the cDNAs have been sequenced only from one 


sci an ty Aaa ee aici AAA ID ITI et hh IO A AO AA 


Dates of release 


Feb. 1997 


July 1993 Dec.1993 Mar. 1994 Aug. 1994 Oct.1994 Feb. 1996 
Homo sapiens 14556 16329 16943 22881 23945 349036 581794 
Caenorhabditis elegans 4699 4699 4699 11590 12104 23438 30196 
Arabidopsis thaliana 1676 3432 4756 8010 8241 24352 29165 
Oryza sativa 1023 4221 4342 4342 4342 11301 12806 
Zea mays 118 O12 988 988 1183 AASV 
B. campestris 181 181 181 965 965 
B. napus 1021 1425 
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Total number of ESTs 

Number of different genes 

Number of genes with homologues in databases, of which: 
Already identified in Arabidopsis 
New member of an Arabidopsis multigene family 
Homologue to a previously described plant gene 
Not yet found in plants 


Table 33.6 Analysis of the first 


et 1152 published Arabidopsis ESTs. 
286 (32%) 

35% 

7% 

33% 

25% 


end and the sequences are not completely overlap- 
ping, preliminary calculations indicated that 35% of 
the putatively identified genes found by the French 
in the seed and silique libraries programme have 
been found in the American programme and that 
30% are significantly similar to rice genes [128] (see 
Chapter 34). However, higher percentages might be 
found when other tissues, organs and physiological 
conditions are analysed. The percentage drops 
dramatically to only a few percentage when non- 
identified ESTs are searched for similarities. These 
preliminary data probably indicate that only the 
most abundantly expressed genes have been identi- 
fied, and that there is still room for independent 
sequencing programmes. 

Other relatively unexpected genes found in 
Arabidopsis include homologues of genes coding for 
the laminin receptor, an annexin, integrins, a Schisto- 
soma haemoglobinase and a selenium-binding pro- 
tein, as well as genes homologous to the Drosophila 
gene abnormal wing disc or to the mouse unp and 
NEDD-6 genes. Approximately 25% of the newly 
identified genes in plants correspond to animal or 
microbial gene sequences, and it would have been 
very difficult or impossible to clone them using 
classical methods. They usually represent genes 
coding for highly conserved proteins such as those 
involved in protein synthesis (ribosomal proteins 
and translation factors), components of the cyto- 
skeleton, such as actins and tubulins, or important 
enzymes from metabolic pathways (energy meta- 
bolism or amino acid biosynthesis). However, a 
number of sequences were completely unexpected, 
such as several tumour suppressor gene homolo- 
gues, including Wilm’s tumour suppressor, a few 
oncogenes such as myb, raf and ras, several 
supposedly brain-specific genes such as a 14-3-3-like 
protein and an acyl CoA-binding protein homolo- 
gous to the benzodiazepam receptor [57]. On the 
other hand, several genes are clearly specific to 
plants, such as those for storage proteins, com- 
ponents of the photosynthetic apparatus, and 
components of the plant cell wall. This series of 
sequences, as well as those from rice [128], will help 
to determine the set of ancient conserved sequences 
which are represented in the plant genomes [129]. 

An interesting lesson from this systematic cDNA 


sequencing project is that it revealed that many 
proteins are encoded by multigene families (Table 
33.7). This was confirmed when homologues were 
identified by PCR. Several of them are now being 
studied in great detail to unravel their pattern of 
expression and understand how they are regulated 
and how they evolved [130-144]. Although Arabido- 


Table 33.7 Examples of multigene families in Arabidopsis. 
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psis is considered to be a rather simple organism, 
some of these gene families have many more mem- 
bers than their animal counterparts. 

Besides these two EST programmes, several cDNAs 
have been completely sequenced independently by 
several laboratories. From GenBank release 77 (June 
1993), 220 full-length cDNA sequences were avail- 
able. However, most of these sequences are also 
represented in the ESTs. Presently, ~350 full-length 
cDNAs have been sequenced. 

Another application of the EST programme has 
been to provide numerous probes to determine 
which genes are under the control of a specific reg- 
ulatory protein—for instance, those whose activity 
is modified by the ABI3 gene product, which 
controls sensitivity to abscisic acid in the seed [145]. 
Due to increasing redundancy both the French and 
American EST programmes are almost completely 
stopped. 


33.7 Genomic sequencing 


Most genomic sequences that have been determined 
correspond to a few genes for which a cDNA could 
be characterized, and the effort towards extensive 
genomic sequencing has begun only recently. In 
GenBank release 77 there were fewer than 120 
Arabidopsis protein-coding genes that had been 
sequenced as genomic clones. A more recent esti- 
mate, based on the EMBL database (March 1997), 
indicates that this figure is now around 600. 
However, this is almost certainly an underestimate 
because many genes have been characterized but 
not necessarily deposited in the database. This is the 
situation for the ESSA programme, which will soon 
release the sequence of 1.8Mb of a region of 
chromosome 4 [146]. Nevertheless, we made a 
calculation on 200 available genes accounting for 
~531 000 bp of genomic DNA. The coding sequences 
represent 230000 bp and the introns 86500 bp. The 
remainder consists essentially of 5’- and 3’-noncod- 
ing flanking sequences. Figure 33.3 shows the 
distribution of the intron number per gene and 
Fig.33.4 shows the size distribution of the 557 
introns that have been scored in protein-coding 
genes. So far, 30% of the genes are found to be 
intronless and the vast majority have no more than 
four introns; an exception is the RNA polymerase II 
large subunit, which has 24 introns. In contrast to 
animal cells, introns are relatively short, 68% being 
smaller than 100 bp, while exons coding for no more 
than 12 amino acids have been identified [147]. Most 
of them have the canonical border sequences. How- 
ever, these data concern only around one-hundredth 
of the genes and the situation should become much 
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Fig. 33.3 Distribution of the intron number per gene ina 
sample of 205 sequenced Arabidopsis genes. 
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Fig. 33.4 Distribution of the intron size in a sample of 557 
introns present in 205 sequenced genes. Size is in bp. 


clearer when systematic genomic sequencing has 
accumulated several megabases of sequence. 

As observed from cDNA analysis, many genes 
belong to multigene families. In a few cases (napins, 
glycine-rich proteins, EF1-o proteins, kin1 protein) 
the genes are clustered at a single locus and we have 
information on the distance separating two adjacent 
genes. It is usually of the order of 1kbp or less, 
indicating that genes are relatively densely organ- 
ized. There is no accurate estimate of gene number. 
However, from genomic sequencing data from 
several laboratories, it seems that the gene density 
might be as high as four or five genes within 20 kbp. 
Assuming this is a general situation and that the 
genome size is 100Mbp, including 15% repeated 
sequences, this gives a maximum gene number of 
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the order of 17000-21500. On the other hand, if we 
estimate that the above value of 200 genes for 
530000 bp is more representative of the average gene 
density, then there would be room for as many as 
30 000 genes. Obviously more data are needed, and a 
more accurate estimate will only be available when 
larger regions have been sequenced. 

Toward this goal, the EU launched in September 
1993 the three-year ESSA project as a model study to 
assess the feasibility of sequencing large portions 
of the Arabidopsis genome. This program concerns 
14 groups who volunteered for either small-scale 
(25kbp per year) or medium-scale (>50kbp per 
year) projects. Clones from two large contigs on 
chromosome 4 in the FCA and APETALA2 regions 
(respectively 1500kbp and 500kbp) have been 
distributed to the participants. In addition, some of 
the participants focused on seven additional loci 
dispersed on several chromosomes. Each one has 
agreed to sequence 75kbp of contiguous DNA 
around his favourite gene. This should provide an 
additional 525 kbp. As mentioned above, this pro- 
gram also includes some cDNA sequencing. All the 
sequence data are collected first and analysed by the 
MIPS in Martinsried, Germany, before being re- 
leased to public databases. This group was already 
in charge of analysing yeast chromosomes for 
the EU programme (see Chapter 30) and this 
association guarantees careful assessment and com- 
parison of the data derived from these two model 
organisms. 

The goals for the first year were 370 kbp of 
genomic DNA and 384 kb of cDNA. After two years, 
more than 1 Mb of genomic DNA has been deposited 
at MIPS and considerably more was almost com- 
pleted; nearly 1Mb of cDNA sequence represent- 
ing 1500 new cDNA clones has also been registered 
by MIPS. The programme was on time, and by 
the end of 1996 more than 2.5 Mbp of Arabidopsis 
DNA had been sequenced by EU scientists. 
The major results are the identification of about 
500 genes, more than 80% being completely new. 
Gene density was on average one gene every 
4-5 kbp. Depending on the region analysed, 
between 37% and 60% of the predicted genes 
matched an EST, thus confirming that the EST 
programmes have tagged approximately half of the 
expressed Arabidopsis genes. From a functional point 
of view, it seems that there are relatively few 
examples of clustering of genes involved ina given 
metabolic or transduction pathway. A few genes 
with similar sequences are tandemly organized. 
Large-scale sequencing also revealed additional 
transposable elements. 

By the end of 1996, anew EU Programme (ESSA 2) 
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was initiated with the aim of sequencing between 5 
and 7 Mbp on chromosome 4 long arm. Meanwhile 
three American consortia and one Japanese consor- 
tium have been organized with similar objectives 
so that by the end of the century approximately 
40% of the genome should be determined and 
the complete sequence of the nonrepeated part 
of the genome should be available around 2003- 
2004. 

Because many genes are duplicated, by analysing 
the flanking regions of multigene family members it 
should be possible to learn how the duplication 
arose. Such a project will provide probes which are 
physically linked. Since homologues of some of the 
chosen loci exist in other plants and organisms, it 
will be interesting to determine how much of this 
local synteny has been conserved and how the 
intergenic regions have evolved in different species. 
Such a synteny is already obvious for the cereals, 
in which large portions of the different genomes 
are colinear [146,148-50]. How much of the 
information gained from analysis of the Arabidopsis 
genome can be transferred to other species remains 
to be determined and will be a goal for future 
programmes. 


33.8 Arabidopsis stock centres and 
Arabidopsis-orientated software 


Arabidopsis molecular geneticists are creating a huge 
amount of biological material in the form of seeds, 
mutants, individual cDNA and genomic clones, and 
a wide variety of libraries in different vectors. All 
these resources need to be preserved, stored and 
distributed in order to benefit the whole scientific 
community. At the same time, scientists are faced 
every day with more information concerning 
genetic resources, genetic and physical maps as well 
as sequence data. It is essential that all the 
information be available as easily as possible, not 
only to scientists interested in this plant, but also to 
other scientists and private companies. In addition 
to sending all the DNA sequences to be published to 
GenBank or EMBL, a number of additional sources 
of information have been set up. It should be 
emphasized, however, that new sites appear very 
frequently and that servers may move or be 
discontinued. 


33.8.1 The Arabidopsis stock centres 


There are two major stock centres functioning as an 
international network. They are in charge of distri- 
buting seeds, clones, and_ libraries. They are, 
respectively, located at Nottingham, UK (Notting- 
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ham Arabidopsis Stock Centre, NASC) and at 
Columbus, Ohio, USA (Arabidopsis Biological 
Resource Center, ABRC). In recent years, NASC and 
ABRC have collected and organized individual 
collections which were at the origin of genetic work 
on Arabidopsis, including those of Rédei, Koornneef, 
and Kranz. Altogether, these three collections rep- 
resent more than 1000 accessions. In addition, the 
stock centres provide numerous mutants which 
have been deposited by individual researchers. 
They provide the recombinant inbred lines mapping 
populations, and a large collection of T-DNA-tagged 
transgenic lines. DNA stocks are preserved and 
distributed by ABRC. They include mapped RFLP 
phages [50] and cosmids [51], several YAC, cDNA 
and genomic libraries as well as individual clones. In 
particular, the cDNA clones partially sequenced by 
the American and French EST programmes are 
stored and distributed by ABRC. One of the goals of 
the stock centres is to encourage researchers to 
deposit seeds and clones as they are published. AUS 
private company, Lehle Seeds, also provides 
mutagenized seeds as well as custom multiplication 
of F1 and F2 progenies. 


33.8.2 AtDB: an Arabidopsis thaliana database 


AtDB (formerly known as AAtDB) is a database [48] 
that can be accessed through a graphical interface 
using software developed for the C. elegans genome 
project (see Chapter 29). It was originally created by 
the US Department of Agriculture Plant Genome 
project through the National Agricultural Library, 
set up by Mike Cherry and Sam Cartinhour, who 
were in the Department of Molecular Biology at 
MGH, Boston, and curated by John Morris. The 
project is now under the auspices of the National 
Science Foundation (NSF) at the Department of 
Genetics, School of Medicine, Stanford University, 
where the second generation software is under 
development. Versions are available for all widely 
used computers (Sun, Digital, SG1, etc., as well as for 
the Macintosh) but it is also available through the 
graphical interface by remote login and by network 
communication tool clients such as Wais, Gopher 
and WWW hypertext clients such as Mosaic or 
Netscape. Information contained in AtDB was 
obtained directly from authors or from various 
public databases. It features a variety of information 
presented using graphical text and tabular format. A 
large number of interconnections allows the passage 
from one type of information to another, simply by 
clicking with the workstation mouse. As an example 
(Fig. 33.5), one can visualize a chromosome map on 
which RFLP markers are indicated, then zoom in on 
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a specific region and obtain sequence and reference 
information on a gene or recombination frequencies 
for markers. Among the items stored in AtDB 
are: 

* cosmid physical map from Goodman’s lab; 

* genetic maps, including the integrated RFLP 
map, the RAPD map and the visible markers map; 

¢ recombination data for F2 and recombinant in- 
bred line mapping populations; 

¢ catalogues of lines and mutants available from 
the two stock centres in Nottingham (UK) and 
Columbus (USA); 

* alist of Arabidopsis scientists with addresses; 

¢ all Arabidopsis DNA sequences registered in 
GenBank and EMBL DNA sequence databases; 

¢ bibliographic citations related to Arabidopsis; 

* various software for DNA analysis. 

AtDB is available without charge via Internet ftp 
transfer or on CD-ROM. A companion guide [48] is 
also provided. Data to be entered in the data base 
should be submitted to the curator of AtDB so that 
the integrity of the database is maintained. 


33.8.3 AIMS: the Arabidopsis Information 
Management System 


AIMS is being developed by S. Pramanik of 
Michigan State University in collaboration with R. 
Scholl of the ABRC. It is funded by the NSE AIMS 
provides support for data management. Data items 
include stock centres, cloned genes available from 
ABRC, information on RFLP, RAPD and _ other 
markers, including annotations on enzymes to be 
used to detect polymorphisms, information on YAC 
libraries available in ABRC, and cross-homology 
between YAC and RFLP markers, genetic mapping 
data, sequence and homology search results for all 
EST cDNA clones, colour pictures of plant pheno- 
types for many of the stocks. AIMS features also 
include graphical display of genetic maps and 
ability to run linkage analysis programs. Contig 
information from the physical map has been added 
to the database. Seeds and clones can be ordered 
through on-line AIMS or EMAIL-AIMS from the 
stock centres (see Section 33.8.5). As is the case for 
AtDB, access is possible through the various 
network communication tools. 


33.8.4 The Arabidopsis newsgroup 


This is the privileged communication link within 
the Arabidopsis community. It allows both simple 
question-and-answer interactions in all fields of 
Arabidopsis research and immediate diffusion of 
important information on materials, meetings, etc. It 
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Fig.33.5 An example of the use of AtDB. The bottom 
window shows the integrated genetic map as 
represented in Fig. 33.1. One can click ona specific DNA 
marker (e.g. g19821) to access the physical map and 
corresponding YACs and cosmids in this contig. A single 
click on any object highlights ‘connected’ items. Double- 
clicking on YAC and cosmid names brings up 
information on these clones, while a double click ona 
genetic symbol opens the genetic map. The ‘buttons’ at 
the top of the map allow opening of menus linked to 


is distributed world-wide through USENET news 
under the name of bionet.genome.Arabidopsis and 
through e-mail. It is one of the most active news- 
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other files associated with selected objects; for instance, it 
is possible to ‘zoom in’ ona particular region of the map, 
as shown. Double-clicking will bring up additional 
information on a marker, with links to literature 
references, sequences (if known), etc. The small 
rectangles on the second line from the left are linked to 
the physical map. (The original images used to prepare 
this figure were kindly provided by John Morris, AtDB 
curator. They have been slightly modified to emphasize 
details and facilitate presentation.) 


groups, having a readership of ~ 1300-1400 people 
and an average of three to four messages daily. 
ARABIDOPSIS postings are indexed in the general 
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‘biosci.src’ WAIS index and are accessible through 
all network clients. Previous postings can easily 
be queried by keyword searches by these methods. 
Many subjects of interest to Arabidopsis researchers 
are also discussed on the plant biology newsgroup. 


33.8.5 Obtaining communication tools 


Knowing how and where to obtain the tools neces- 
sary for using the Internet is possibly the hardest 
step in connecting to the different services available. 
An excellent starting point for those new to the 
Internet is A Biologist’s Guide to Internet Resources by 
Una Smith. It can be obtained by sending an e-mail 
message to ‘mail-server@rtfm.mit.edu’ containing 
the line ‘send pub/usenet/sci.answers/biology/ 
guide/*” 


Anonymous ftp A short guide to anonymous ftp can 
be obtained by sending the message ‘help’ to 
‘info@sunsite.unc.edu’. ftp archives can be search- 
ed using an Archie client, available from one of the 
many archie servers and other software sites. 


Internet Gopher Gopher clients for most types of 
computers are available from the University of 
Minnesota, where Gopher was developed, by 
connecting to boombox.micro.umn.edu. World- 
wide Gopher sites can be searched using the gopher 
tool Veronica. 


World Wide Web This is another tool for retrieving 
information on the Internet and was developed at 
the European Particle Physics Laboratory (CERN). 
The ease of use of World Wide Web (WWW) clients 
simply by clicking with the mouse on highlighted 
words in the text has led to the creation of a very 
large number of server sites throughout the world. 
Mosaic software, developed at the University of 
Illinois, is available by anonymous ftp from ftp.ncsa. 
uiuc.edu or by Gopher from gopher.ncsa.uiuc.edu. 
The widely used Netscape software can be obtained 
from ftp.mcom.com. 


33.8.6 Useful addresses 


The following addresses are given to facilitate 
communication with the Arabidopsis community 
and to identify key persons or services. This is far 
from being an exhaustive list, but most sites have 
links to many others. 


Stock centres 
Mary Anderson: NASC (Nottingham Arabidopsis 
Stock Centre), Department of Life Science, Univer- 


sity of Nottingham, University Park, Nottingham 
NG7 2RD, UK 

Tel.: (+ 44 115)979 1216 

Fax: (+ 44 115) 9513251 

E-mail: Arabidopsis@nottingham.ac.uk 

WWW: http:/ /nasc.nott.ac.uk 

Mary Anderson is the strain curator for AtDB. 


Randy Scholl and Keith Davis: ABRC (Ohio State 
Arabidopsis Biological Resource Center), Ohio State 
University, 1735 Neil Avenue, Columbus, OH 43210, 
USA 

E-mail: Arabidopsis + @osu.edu 


Seed stocks 

Tel.: (+ 1 614) 292-9371 for seed ordering information 
Tel.: (+ 1 614) 292-1982, Randy Scholl (for general 
and seed-related questions) 

Fax: (+ 1 614) 292-0603 

E-mail: seeds@genesys.cps.msu.edu 


DNA stocks 

Tel.: (+ 1 614) 292-2115, Keith Davis (for DNA- 
related questions) 

Fax: (+ 1 614) 292-0603 

E-mail: dna@genesys.cps.msu.edu 


AIMS 

Contact Sakti Pramanik, Department of Computer 
Science, A729, Wells Hall, East Lansing, MI 48824, 
USA 

For information on remote access, send an e-mail 
message with subject line ‘help’ to inquire- 
aims@aims.msu.edu 

WWW to AIMS and ABRC: http://genesys.cps. 
msu.edu:3333/Help from: aims-manager@genesys. 
cps.msu.edu 


AtDB 

Located in the Department of Genetics, School of 
Medicine, Stanford University, USA 

Contact atdb-curator@genome.stanford.edu 
Anonymous ftp to ftp-genome.stanford.edu 

WWW: http://genome www.stanford.edu/Arabi- 


dopsis / 


Arabidopsis electronic newsgroup 

Information on all BIOSCI news groups and means 
of receiving messages can be obtained by anony- 
mous ftp to net.bio.net in the folder pub/BIOSCI/ 
doc or send the message help to biosci@daresbury. 
ac.uk (Europe, Africa and Central Asia) or 
biosci@net.bio.net (Americas and the Pacific rim). 
Do not send subscription messages to the list 
address. 
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Mutants curator 

David Meinke, Oklahoma State University, Still- 
water, OK 74078, USA 

Fax: (+ 1 405) 744-7673 

E-mail: btnydwm@mvs.ucc.okstate.edu 

WWW: http: //mutant.lse.okstate.edu/meinke.html 


Physical and genetic mapping data 

From the Ecker lab, University of Pennsylvania, 
USA 

E-mail: atgc@atgenome.bio.upenn.edu 

WWW: http://cbil-humgen.upenn.edu/~atgc/AT 
GCUP.html 


Arabidopsis cDNA sequencing analysis project 
Contact Tom Newman, MSU, Michigan State 
University, East Lansing, MI 48824-1312, USA 

A good description of the techniques to access 
information on ESTs is given in ref. 59. 

Tel.: (+ 1517) 353-0854 

Fax: (+ 1517) 353-9168 

E-mail: ten@msu.edu 

WWW: http:/ /lenti.med.umn.edu/ Arabidopsis 


French EST programme 
Contact cooke@univ-perp.fr 


Flower development 

The Yanofsky Lab, San Diego, CA, USA 

WWW: http://www-biology.ucsd.edu/others/yan- 
ofsky/home.html 


Seeds 

Lehle seeds, PO Box 2366, Round Rock, TX 78680- 
2366, USA 

Tel.: (800) 881-3945 (USA only) or (+1 512) 388-3945 
(outside USA) 

Fax: (+ 1512) 388-3974 

WWW: http:/ /www.Arabidopsis.com/ 


The Multinational Science Steering Committee 

This committee is in charge of co-ordinating the 
Multinational Co-ordinated Arabidopsis thaliana 
Genome Research Project and is presently composed 
of: Chair: David Meinke, Oklahoma State Univer- 
sity, Stillwater, Oklahoma, USA; Michel Caboche, 
Lab. Biol. Cellulaire, INRA, Versailles, France; 
Richard B. Flavell, John Innes Centre, Norwich, UK; 
Howard Goodman, Massachusetts General Hospital, 
Boston, Massachusetts, USA; Gerd Jurgens, 
University of Tibingen, Tiibingen, Germany; Jose 
Martinez Zapater, DPTO de Proteccion Vegetal, 
Madrid, Spain; Bernard J. Mulligan, University of 
Nottingham, Nottingham, UK; Mare Van Montagu, 
University of Ghent, Ghent, Belgium; Robert Last, 


Boyce Thomson Institute, Ithaca New York, USA; 
Kiyotaka Okada, Kyoto University, Kyoto, Japan. 
Addresses available from AtDB. 


Progress reports can be obtained from: 

Fax: (+ 1 703) 644-4278 

E-mail: pubs@nsf.gov 

WWW: http://www.nsf.gov/bio/ pubs /arabid / 


EU ESSA Project 

Mike Bevan, responsible for the EU ESSA project, 
John Innes Plant Science Centre, Cambridge Labora- 
tory, Norwich, NR7 4UJ, UK 

Tel.: (+ 44 1603) 452571 

Fax: (+44 1603) 456844 

E-mail: michael.bevan@bbsre.ac.uk 


General biological servers (excellent starting 
points) 

Pedro's molecular biology research tools 

WWW: http://www.public.iastate.edu/~pedro/ 
research_tools.html 

http: / /www.biophys.uniduesseldorf.de/bionet/re 
search_tools.html 

http:/ /www.fmi.ch/biology/research_tools.html 
http://www. peri.co.jp/Pedro/research_tools.html 


Keith Robison’s list of tools 
WWW: http://golgi.harvard.edu/sequences.html 


Atelier Bioinformatique de Marseille 
WWW: http://www biol.univmrs.fr/biologie/log- 
ligne.html 


Biological information servers, Stanford 

WWW: gopher://genome-gopher.stanford.edu/11/ 
bio 

gopher: genome-gopher.stanford.edu/11/bio 


Data banks: only WWW starting points are given 
There is a wealth of information at all sites. 


NCBI (GenBank, etc.) 
http://www.ncbi.nlm.nih.gov 


European Bioinformatics Institute (EMBL, etc.) 
http: / /www.ebi.ac.uk 


dabEST 
http://www.ncbi.nlm.nih.gov/dbEST /index.html 


Richard Cooke and Michel Delseny, the authors of this 
paper 
cooke@univ-perp.fr and delseny@univ-perp.fr. 
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33.9 Arabidopsis as a model 
for other species 


33.9.1 Arabidopsis as a model for plant species 


The major advantage of deciphering the genome of a 
model plant species is certainly to progress in the 
understanding of plant biology and development, 
particularly of cultivated crops. Many genes regulat- 
ing development are being genetically identified by 
mutations, but the corresponding gene has often still 
to be isolated and the biochemical function of its 
encoded protein elucidated. Such genes are likely to 
be conserved between all plant species. Owing to its 
small size, small genome and ease of genetic 
analysis, the model plant Arabidopsis allows most of 
these processes to be elucidated more easily and 
more rapidly than in any other plant species. 

Arabidopsis will soon be the plant species for 
which we have the most detailed understanding of 
both its genome and physiology. Very close to it, at 
least for the CDNA and genome organization, is rice, 
for which more than 11000 ESTs, a detailed RFLP 
map (at 1cM resolution), and YAC, BAC and cosmid 
libraries are available (see Chapter 34). However, the 
rice genome is about four times bigger than that 
of Arabidopsis and therefore genomic sequence 
information should be available in the latter species 
much more rapidly. In addition, there are more 
known mutants in Arabidopsis and this species is still 
much easier to transform than rice. 

Indeed, plant biologists need two models because 
higher plants are divided into two large groups, the 
monocotyledons and the dicotyledons, which, in 
addition to obvious morphological differences, 
differ in their codon usage. Monocotyledon codons 
are strongly biased toward C or G in the third 
position. Accordingly, few probes cross-hybridize 
between the two groups. Arabidopsis can be used as 
a model for dicots whereas rice is the model 
for monocots. Most Arabidopsis genes will cross- 
hybridize in relatively stringent conditions with 
genes from other crucifers or closely related families. 
Stringency will have to be reduced for more 
distantly related families. 

Although it is not always possible to isolate genes 
from crops with Arabidopsis probes by cross- 
hybridization, the availability of large collections of 
ESTs from this species as well as from rice should 
allow identification of gene regions coding for 
conserved motifs. From these, it should be possible 
to derive PCR primers taking codon usage bias into 
account. Therefore, with the information from both 
model plants it should be possible to isolate most 
common genes without too great an effort [123]. In 


addition, the plant information, combined with that 
available from yeast, C. elegans, mouse and humans 
should help in designing PCR primers enabling 
homologous genes to be isolated from any living 
organism. 

Many genes are not only conserved in function 
and sequence among plant species, but are also often 
in the same order on the chromosomes in different 
species. This phenomenon of synteny, which is most 
spectacular in cereals [150] (see Chapter 32), opens 
the way to comparative mapping and cloning. For 
example, it is now clear that rice chromosome 1 
largely corresponds to wheat homeology group 3, 
and there are many examples of such synteny 
between rice, maize, sorghum, wheat and sugarcane 
[148,149]. Although such analysis is less advanced 
with Arabidopsis, which is a wild-type species, there 
is extensive information on the Brassica genomes, 
which belong to the same family [151-153]. The 
strategy for comparative cloning will be to map a 
gene of interest in a given crop, then move to the 
homologous region in the model species, isolate the 
functionally equivalent gene and then return with 
this probe to the original crop genomic library. Such 
a strategy should be extremely useful, for instance, 
in tracking disease-resistance genes since it has 
recently been demonstrated that the RPG1 locus of 
soybean (conferring resistance to the pathovar 
Pseudomonas syringae pv glycinea avrB race) is 
functionally equivalent to the Arabidopsis locus RPS3 
[154]. Thanks to work on Arabidopsis, several im- 
portant genes could be isolated in other species: a 
striking example is the Arabidopsis AGAMOUS gene, 
which controls flower development and for which 
homologues have been isolated in several species 
including maize and tomato [155,156]. 

A general strategy for identifying the function of 
an unknown gene is to antisense, overexpress or 
disrupt it and try to observe a phenotype in order to 
discover which process is altered. The first two 
strategies are not always successful, and gene 
disruption is not yet possible by homologous 
recombination in plants. However, the availability 
of a large collection of independent insertion mut- 
ants saturating the genome should allow the isola- 
tion of lines in which the desired gene is knocked 
out. Of course this strategy might not be successful 
either if, as in yeast, many gene disruptions do not 
confer a detectable phenotype. One of the major 
difficulties is certainly the presence of several copies 
of functionally similar genes. 

The second problem, that of finding the gene 
corresponding to a mutant, can be solved without 
too much effort when fairly detailed genetic and 
physical maps are available. Because these two tools 
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will soon be available for Arabidopsis, isolating a 
gene for which a mutant is known should become 
increasingly easier. Genes corresponding to specific 
mutations can also be isolated relatively easily if the 
mutation is tagged by a known insertion. Again, 
only Arabidopsis provides all these facilities in a 
plant, although some progress is also being made in 
rice. However, the final test to demonstrate that the 
correct gene has been isolated is to complement the 
mutant and this is still very difficult in most species. 


33.9.2 Arabidopsis as a model for 
non-plant species 


Arabidopsis can be used as a model for non-plant 
species, including animal genomes, in at least two 
ways. The first is to provide a source of homologous 
genes for comparison with those from other organ- 
isms. Comparative analysis enables identification of 
the conserved sequences and structures in a given 
protein that are really important for, say, its enzyme 
activity and its specific interactions with other 
proteins and substrates. The second is to use the 
ability to genetically manipulate the genome of this 
model plant in order to introduce genes from other 
genomes. 

The major advantage of having several catalogues 
of ESTs from different species is in enabling the 
eventual determination of how many genes are 
common to all living kingdoms. Although we do not 
have any precise estimate, about a third of the non- 
redundant plant ESTs are homologous to something 
already in the databases. Many genes are indeed 
common to both animal and plant cells, and 
comparison of their sequences should tell us how 
basic conserved protein motifs have evolved and 
have become specialized for different functions. 
Several striking points have already emerged. Some 
well-conserved genes are represented in plants by 
multigene families that have many more members 
than their animal counterparts (Table 33.7). Each of 
these genes seems to have evolved a specific pattern 
of tissue- or organ-specific expression and differen- 
tial response to environmental stress. Comparison of 
data from Arabidopsis, rice and other plant species 
suggests that most of these genes duplicated long 
ago, and now comprise several related subfamilies 
that have differentiated in different species. 

One of the most striking examples of homology 
between plant and animal genes is in the homeobox 
genes that code for proteins with DNA-binding 
homeodomains [140]. These genes were initially 
described in Drosophila, where they are involved in 
pattern formation, segmentation and the specifi- 
cation of the various appendages. Similar genes 


have been found in vertebrates, where they also 
seem to be involved in segmentation. Recently, 
homologous homeobox-containing genes have been 
described in plants, where the homologous function 
does not exist [79]. Indeed, two such genes encode 
proteins of the phytochrome-controlled signal 
transduction cascade, and these genes are turned on 
by far-red light. However, BELL1, a member of the 
homeobox gene family, has recently been shown to 
be involved in pattern formation in the Arabidopsis 
ovule primordia [157]. The proteins encoded by 
these genes function as transcription factors, and the 
same motifs have most likely been re-used to 
activate a different transduction pathway. Recently 
the CURLY LEAF gene (CFL) from Arabidopsis has 
been demonstrated to be an homologue of Enhancer 
of Zeste a member of the Polycomb gene family in 
Drosophila which control the activity of homeo 
domain genes. CFL is functionally similar in 
repressing the activity of AGAMOUS, a homeotic 
gene controlling flower formation, in vegetative 
tissues [158]. 

Several other genes coding for components of the 
light-signalling pathway (COP1, COP9, COP11 and 
DET1) show striking similarities to Drosophila, 
human and C. elegans genes or ESTs, and might help 
in identifying developmental regulatory genes 
shared by plants and animals [159]. The same is true 
for the myb oncogenes, which code for a group of 
transcription regulatory proteins and which are 
represented by gene families in higher plants: at 
least five myb-related genes have been described in 
the snapdragon (Antirrhinum), and two from Ara- 
bidopsis have already been described and analysed in 
detail: one is responsible for the differentiation of 
trichomes [160] while the other is specifically in- 
duced by drought [77]. Another example of con- 
servation is the 14-3-3-like protein gene [138], which 
was initially described as a brain-specific gene. In 
fact, it codes for a protein phosphatase inhibitor 
and this function has obviously been reused in 
several plant signal transduction pathways. 

Some animal genes will repay study in the plant 
context, where they can be subject to fine genetic 
analysis and the effects of ectopic expression follow- 
ing transformation can be observed. For instance, 
the mode of action of some tumour suppressor 
genes and their relationship with the cell cycle might 
be much easier to analyse in Arabidopsis than in 
human cell cultures. This is certainly the case for the 
well-conserved Wilm’s tumour suppressor, for 
which a plant homologue has been recently describ- 
ed in Arabidopsis and rice [161]. A further example is 
given by the recent demonstration that the Arabi- 
dopsis Cnx1 protein is capable of complementing the 
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Escherichia coli Moco (molybdenum cofactor) mutant 
mogA [162]. The plant protein shows homology to 
mammalian proteins whose function is uncertain, 
although the inborn loss of human Moco leads to 
impaired development of the brain and parts of the 
nervous system [163]. Further studies on Cnx1 could 
well shed light on the function of the mammalian 
homologues. Similarly, it might be of interest to 
learn that a gene which was initially supposed to be 
specific to the human brain but whose function is 
obscure is expressed in some plant tissues. Another 
similar example is the dad1 gene (deficient in 
apoptotic death) which is also present in Arabi- 
dopsis [164]. This kind of link can provide hints 
concerning the function of some genes. 
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34.1 Introduction 


Rice (Oryza sativa L.) is the main food in Asia, and for 
thousands of years varieties have been selected on 
the basis of the best agronomic characters. Rice 
breeding has been very successful in improving 
yields and crop reliability, but further improvements 
using biotechnology are necessary to keep up with 
the population growth of Asia. 

During the past two decades the molecular 
biology of rice has advanced greatly, especially in 
Japan. Much information on its genetics is already 
available, with over 150 markers on the classical 
linkage map [1]. Rice has also recently emerged as a 
model plant for mapping all cereal genomes and 
isolating agronomically or scientifically important 
genes for plant breeding. This is due to its small 
genome size (430 megabases (Mb), the smallest in 
cereals), the availability of detailed linkage maps 
[2,3], and the large amount of synteny found 
between rice and other cereals [4]. Rice as a model 
monocot plant also complements the dicot model 
plant Arabidopsis thaliana (see Chapter 33), and 
together they will reveal the genome organization 
and gene function of most plants. 

The Japanese Rice Genome Research Program 
(RGP) was launched in 1991 by the Ministry of 
Agriculture, Forestry and Fisheries to carry out 
comprehensive genome mapping and large-scale 
expressed gene sequencing. The aim is to promote 
biotechnological applications in rice and cereal 
breeding worldwide. For this purpose information 
on the research results is released as soon as 
possible. 

Since 1992, RGP has published a semiannual 
newsletter, now being sent free of charge to some 
2000 researchers in over 50 countries. Some 10000 
restriction fragment length polymorphism (RFLP) 
markers and 1000 cDNA clones have been distri- 
buted to about 200 laboratories worldwide and over 
10000 cDNA sequences made available in the inter- 
national databanks. In December 1994, an Internet 
information service on rice genome research data 
was launched on the World Wide Web (address: 
http://www:staff.orjp) as well as an ftp server 
(address: ftp.staff.or.jp) for distributing large data 
files to all plant genome researchers. 

This chapter describes the current techniques 
being used and the results from the first three years 
of the programme, with technical details that should 
be helpful for people launching or expanding other 
plant genome projects. 


34.2 Large-scale rice cDNA analysis 


34.2.1 cDNA library construction and sequencing 


In recent years, several large-scale cDNA projects 
have been in progress for various organisms, includ- 
ing humans [5], rice [6,7], Arabidopsis [8] and the 
nematode Caenorhabditis elegans [9,10]. Large-scale 
cDNA analyses have two great advantages for the 
investigation of genomes. First, isolated and partially 
characterized cDNA clones can be used as probes for 
genetic linkage and physical mapping. The cDNA 
clones have been used not only as expressed sequence 
tags (ESTs) on the RFLP linkage map [2] (Section 34.4) 
but also as good probes to screen YAC clones for the 
construction of the physical map of the chromosomes 
(Section 34.5). Second, isolated cDNA clones encode 
the amino acid sequence of expressed proteins. 
Therefore, if the CDNA library contains the cDNAs of 
all expressed proteins, the primary structure of any 
protein can be obtained from the library. A good- 
quality cDNA clone library will also be a powerful 
tool for isolation and characterization of useful genes 
for breeding and other applications. 

In Arabidopsis, about 25000 genes are thought to 
be expressed [11,12]. In the case of rice, the size of 
genome is considered to be about three times larger 
than that of Arabidopsis [13], and the total number of 
the expressed genes in rice is estimated to be roughly 
30000. The final aim of the large-scale cDNA analy- 
sis in rice is to catalogue the cDNAs of all expressed 
genes. Toward this goal, the RGP has been isolating 
and sequencing rice cDNAs. The isolation and 
characterization of several rice CDNA clones show- 
ing sequence similarities with known genes and 
proteins have previously been reported, such as the 
ATP/ADP translocator [14], the mitochondrial 
ATPase B-subunit [15], cdc2 [16] and the NADP- 
dependent malic enzyme [17]. About 2200 callus 
cDNAs have recently been isolated and partially 
sequenced, and around 700 of these were found to 
have significant homologies with known proteins 
[6]. A japonica rice variety, ‘Nipponbare’, has been 
used for cDNA analysis. 

In order to characterize the cDNAs of all expres- 
sed genes, it is necessary to construct various CDNA 
libraries prepared from different rice tissues, includ- 
ing callus, under different growing conditions, since 
many rice genes are expected to be expressed only in 
specific tissues and at specific growth stages, or only 
when the plant is exposed to specific environmental 
stresses. So far, we have prepared cDNA libraries 
from root, green shoot and etiolated shoot. We have 
also made libraries from calli grown in four different 
culture conditions: growth-phase callus grown in 
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medium with 2,4-dichlorophenoxyacetic acid; 6- 
benzyladenine-treated callus grown in a medium 
with 6-benzyladenine (BA); redifferentiation callus 
grown in a medium with 6-benzyladenine and 1- 
naphthaleneacetic acid; and heat-treated callus 
grown at 37 °C (Figs 34.1 and 34.2). 

Figure 34.3 shows the strategy for cDNA analysis 
used in the RGP. Total RNAs were isolated from the 
tissues by a single-step guanidinium thiocyanate 


Rice 'Nipponbare’ 
In the light | In the dark 
Sowing Sowing Sowing 
. + + 
9 days 8 days 


Preparation of 
DNA libraries. 


Fig. 34.1 Source materials for root and shoot cDNA 
libraries. Roots and green shoots were harvested from 
seedlings grown in the light at 9 and 8 days after sowing, 
respectively. Aetiolated shoots was harvested from 
seedlings grown in the dark at 8 days after sowing. Total 
mRNA was isolated from the harvested tissues and 

cDNA libraries were prepared. 


Embryo of rice seed (‘Nipponbare’) 
| MS (+2, 4-D) 
Induction of callus 
} No (+2, 4-D 25°C) 


Subcultivation of callus 
> N6 (2, 4-D 25°C) 


Growth Redifferentiation BA-treated _ Heat-treated 

phase callus callus callus 

callus = NAA (0.01mgI!) BA (Img!) Heat shock 
2, 4-D (Img I!) BA (0.1mg I") (37°C 3.5h) 


4 


jou Preparation of cDNA libraries pall 


Fig. 34.2 Source materials for callus cDNA libraries. 
Induced callus was subcultivated in N6 medium 
containing 2,4-D and then grown in four different 
conditions as indicated. Callus was harvested after 12 
days cultivation, and total mRNA isolated. 


method (see Chapter 18). The poly(A)* RNA was 
purified through Oligotex-dT 30 (Daiichi Kagaku, 
Japan). Then cDNAs were synthesized according to 
Superscript Plasmid System (BRL). Adaptors were 
ligated asymmetrically to cDNAs and cloned into 
pBluescript II SK+ having a Sall site at the 5’ end 
and a NotI site at the 3’ end. Transformation was 
performed into NM522 competent cells (Stratagene). 
Clones were picked randomly for sequencing. The 
insert length of cDNAs in plasmid DNAs was 
checked by agarose gel electrophoresis after double 
digestion with SalI and NotI. The plasmid DNAs 
were prepared by a robotic machine (Kurabo, 
Japan). 

Template DNA was prepared as single-stranded 
(ss) DNA rescued by helper phage M13KO7 (see 
Chapter 21), and sequenced by the dideoxy method 
(see Chapter 22). Recently, several robotic work- 
stations have been introduced to scale up this step. 
In the RGP, several manual and automated ssDNA 
preparation systems have been used in parallel as 
summarized below. 

1 Classical method PEG-phenol method: a manual 
system requiring 3h for preparation of 24 samples 
(see Chapter 21). 

2 Fast Magnetic Purification Kit (Amersham, UK) 
Uses magnetic beads: a manual system requiring 
1.5h for preparation of 24 samples. 

3 EasyPrep M13 Prep Kit (Pharmacia Biotech, USA) 
Uses glass-fibre filter: a manual system requiring 
1.5h for preparation of 24 samples. 

Sequencing reaction: 

1 Cycle sequencing method (see Chapter 22) with a 
thermal cycler Manual system requiring 3h for pro- 
cessing of 24 samples. 

2 BcaBEST Dideoxy Sequencing Kit (TaKaRa, Japan) 
Manual system requiring 2h for processing of 24 
samples. 

3 Catalyst Robotic Workstation (Perkin Elmer/Applied 
Biosystems, USA) Automatic system requiring 5h 
for processing of 24 samples. 

We also use a DNA sequencing robot (Amersham) 
as a semiautomatic system for preparing ssDNA for 
sequencing reactions. It is a combination system 
with a Fast Magnetic Purification system and a ATaq 
cycle sequence. The system carries out these steps 
for 24 samples within 5h. Many companies are now 
developing this type of automated system and their 
use will become mainstream in large-scale DNA 
template preparation technology. Sequencing is 
being carried out by automated DNA sequencers 
using chemical labelling (ABI model 373A, 
Perkin Elmer/Applied Biosystems). In the RGP at 
present, 11 DNA sequencers are being run in 
parallel. Currently, over 100000 bp of cDNAs can be 
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czevee 


Table 34.1 No. of analysed rice cDNAs as of February 1995. 


SAOTHHOAKESTHHOHEE EEE STTHHOSHHEHEEEHERH ERE HESEHR OSE REECE HES 


cDNA library Analysed clones Hit clones? Jo Submitted clones< 
Callus 
BA‘ treatment 1915 472 24.6 608 
NAA‘ + BA treatment 442 118 26.7 0 
Heat shock treatment! 2673 581 DA 0 
Growth phase 2492 740 DON, 2447 
Root 1965 577 26.0 1849 
Green shoot 4759 1280 26.9 3431 
Etiolated shoot 4211 905 25 2654 
Others 2511 567 22.6 0 
Total 20968 5174 24.7 10989 


*Clones with significant similarities to known proteins. The FASTA algorithm was used for the similarity search to the 
PIR database and an optimized score of at least 200 was required for putative assignment. 


> (Hit clones /analysed clones) x 100. 


©The sequence data of clones were submitted to the DDBJ database. 


‘BA, 6-benzyladenine. 
* NAA, 1-naphthaleneacetic acid. 


‘This callus was treated at 37 °C for 3.5 h, after incubating at 25°C for 12 days. 
’ Callus grown in medium with 2,4-dichlorophenoxyacetic acid (2,4-D). 


sequenced per day. New types of sequencers that 
can analyse a larger number of samples and read 
longer sequences faster, are being developed. The 
rice cDNA analysis in the RGP was started in the 
autumn of 1991, and as of February 1995, we have 
isolated and partially sequenced over 20000 cDNA 
clones from various cDNA libraries prepared from 
intact plant tissues and calli (Table 34.1). The 
sequence data from the cDNA clones have been 
stored in our in-house database RiceBase (see 
Section 34.6). Over 10000 sequences have already 
been released and are available through DNA Data 
Bank of Japan (DDBJ), GenBank and EMBL. 


34.2.2 cDNA identification, tissue-specificity 
and redundancy 


Nucleotide sequences were transferred via a com- 
puter network to the main computer. The sequence 
data were translated to amino acid sequences for all 
three frames, and a similarity search of each 
sequence with known protein sequences in the PIR 
database was done using the FASTA algorithm [18]. 
The cDNA clones whose sequences showed an 
optimized similarity score over 200 were selected 
and considered to have homologies with the 
proteins in the database. We frequently found clones 
showing homologies with several different proteins. 
In such cases, we tentatively identified the cDNAs as 
encoding the protein showing the highest score 
among the candidate proteins. 

The number of cDNA clones analysed in the RGP 
from October 1991 to February 1995 is summarized 


in Table 34.1. These sequenced clones were exam- 
ined for similarities of predicted amino acid 
sequences to known proteins in the PIR database 
using the FASTA algorithm. Only 25% of the clones 
showed significant similarities with known pro- 
teins. Thus, most of the isolated cDNA clones are 
considered to encode unknown proteins. Table 34.2 
shows some of the putatively identified genes from 
the root cDNA library and the growth-phase callus 
cDNA library. 

The results of the similarity search showed that 
the cDNA libraries from various tissues were signifi- 
cantly different in their specificity of gene expres- 
sion. As expected, the clones related to photosyn- 
thetic proteins were mainly obtained from the green 
shoot cDNA library. The cDNA clones from etiol- 
lated shoot complemented those from green shoot, 
but some proteins with unknown functions, such as 
viscotoxin, were also found. This suggests that some 
genes suppressed under light are stimulated and 
expressed in dark. In the root cDNA library, clones 
with significant homology to peroxidase, ribosomal 
proteins, and some metal-binding proteins were 
identified. 

The characterization of the clones obtained from 
callus cDNA libraries also showed features of gene 
expression specific to the growth conditions 
(Fig.34.2). Many ribosomal proteins and histone 
genes were found in growth-phase callus. For BA- 
treated callus, several chitinase genes, including an 
unknown plant chitinase class III, were identified. 
The cDNA library from redifferentiation callus 
included a higher percentage of clones encoding a- 
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Table 34.2 List of putatively identified proteins from rice callus and root cDNA. 


a 
Optimized Putatively identified 


Initial 
score 


score 


protein name 


Original 
species name 


Se eee ee ee ee ee een eee eee eee 


Clone DDBJ Length Match 
name accessionno. inaa (%) 
RA0578 D23922 92 89.8 
RA1538 D24218 115 100 
RA2648 D24845 112 91 
RA0411 D23852 116 87.9 
RA1356 D39052 111 92.8 
CK0982 D15628 118 100 
CK1555 D15925 120 84.9 
RA3246 D25115 109 97.2 
RA3368 D25150 90 100 
CK1681 D22874 120 97.8 
RA1895 D24441 142 89.4 
RAO0210 D23806 124 89.1 
RA1884 D24433 143 100 
CK0686 D15470 133 100 
CK2996 D23554 87 94 
RA1753 D24337 119 98.9 
RA2856 D24965 146 86.3 
RA2656 D24852 118 94.1 
RA2531 D24771 112 97.7 
CK1328 D15815 131 97.5 
RA2584 D24805 121 89.2 
CK0258 D15204 109 81.7 
RA3384 D39350 Bil 86.2 
CK1525 D15917 116 93.5 
CK1991 D15999 109 97.3 
CK1229 D15777 100 94.1 
RA2717 D24888 70 100 
CK2630 D16057 130 80.2 
RA2741 D24902 127 88.7 
RA2075 D24506 122 95.1 
RAO111 D23770 140 83.6 
RA0876 D24022 91 100 
CK1047. D15663 122 98.8 
CK1142 D15718 147 94.6 


(S)-tetrahydroberberine 
oxidase 

14-3-3 protein homologue 

2,3-bisphosphoglycerate- 
independent phospho- 
glycerate mutase, 
PGAM 

26S protease subunit 4 

3-Oxoacyl-[acyl-carrier- 
protein] synthase 
precursor, chloroplast 

Actin 1 

Acyl carrier protein 3 

Adenosylhomocysteinase 
(2@3.3.1.1) 

ADP, ATP carrier protein 

Alcohol dehydrogenase, 
ADH1 

Aspartate 
aminotransferase 

Aspartic proteinase 

ATP/ADP translocator 
protein 

ATPase 

BBCI protein 


Calmodulin 
Casein kinase I o-chain 
Catalase chain 1 
cdc2 protein kinase 
homolog 1 
Chaperonin 60 
Cinnamyl-alcohol 
dehydrogenase 
Cold-induced 
protein BnC24A 
Cyc07 protein, 
S-phase specific 
Cytochrome b5 
Cytochrome c 
Elongation factor 1 
B’-chain 
Elongation factor 
eEF-1 o-chain 
Elongation factor 
eEF-1 B-Al chain 
Elongation factor eEF-2 


Enolase 

Formate dehydrogenase 
precursor, 
mitochondrial 

Fructose-bisphosphate 
aldolase, cytosolic 

GF14-12 protein 

Glucose-1-phosphate 
adenylyltransferase 


Coptis japonica 


Rice 
Maize 


Human 
Barley 


Rice 

Barley 

Madagascar 
periwinkle 

Rice 

Rice 


Proso 
millet 

Rice 

Rice 


Rice 
Arabidopsis 

thaliana 
Rice 
Maize 
Maize 
Rice 


Cucurbit 

Kidney 
bean 

Rape 


Madagascar 
periwinkle 

Rice 

Rice 

Rice 


Wheat 


Arabidopsis 
thaliana 
Chlorella 
kessleri 
Maize 
Potato 


Rice 


Maize 
Barley 


na ee 


Continued on p. 794. 
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Table 34.2 Continued. 


Clone DDBJ Length Match Initial Optimized Putatively identified Original 

name accessionno. inaa (%) score score protein name species name 

CK2690 D16060 151 89.2 535 538 Glutamate synthase Alfalfa 
(NADH) (EC 1.4.1.14) 

CK2834 D16071 128 100 233 233 Glutamate-ammonia Maize 


ligase(EC 6.3.1.2) 
1 precursor, chloroplast 


RA2455 D24733 118 89 481 492 Glycine hydroxymethyl- Flaveria 
transferase (EC 2.1.2.1) _ pringlei 
CK1818 D22933 110 O77, 272 392 Glycine-rich cell wall Rice 
structural protein 2 
precursor 
CK0553 = D38799 112 83.3 307 307 Glycine-rich protein Maize 
CK1467 = D28232 136 100 382 382 GOS2 protein Rice 
CK0661 D15451 111 100 411 411 GTP-binding protein Rice 
CK2495 16050 114 100 308 308 GTP-binding proteinrab Garden pea 
CK1388 D15842 84 98.8 410 410 GTP-binding proteinrgp2 Rice 
CK0922 D22687 109 80.2 477 487 GTP-binding regulatory Chlamydomonas 
protein B-chain reinhardtii 
homologue 
RA2635 D24836 113 98.4 335 336 Guanine nucleotide Fava 
regulatory protein bean 
RA1512 D24194 119 100 326 326 H*-transporting ATP Maize 
synthase B chain, 
mitochondrial 
RA1571 024242 134 98.5 669 669 Heat shock protein Rice 
82, HSP82 
CK1497 —_D15907 133 94 318 387 High mobility group Maize 
protein 
CK0893 =D22681 87 100 247 247 Histone H2B.2 Wheat 
CK1254  D22765 108 90 346 347 Histone H4 Maize 
RA2248  D24610 95 95 272 PDD Immunoglobulin-binding Maize 
protein homolog b70 
RA2031 D24486 92 100 479 479 Initiation factor 4A Rice 
CK1526 D22833 119 97.4 565 565 Initiation factor eIF-4A Curled-leaved 
tobacco 
RA2404  D24702 104 98.7 367 367 Initiation factor eIF-5A Common 
tobacco 
RA0195  =D23800 108 925 479 498 Isocitrate dehydrogenase Soybean 
RAO0078 D23757 169 84.2 430 430 KatA protein Arabidopsis 
thaliana 
CK2640 D23334 143 90.1 627 628 Ketol-acid Arabidopsis 
reductoisomerase thaliana 
(EC 1.1.1.86) 
RA2209 D24582 ie 85.6 477 481 Lipoamide Peas 
dehydrogenase, LDH, 
Lsubunit of glycine 
decarboxylase 
RA0886 D24025 96 90.6 275 275 Malate dehydrogenase Water-melon 
precursor, 
mitochondrial 
RA3079 —_D39236 115 94.6 352 652) Methionine Tomato 
adenosyltransferase 
(EC 2.5.1.6) 
CK1912 D15997 126 O7ai 469 469 Monoubiquitin-tail Barley 
protein 2 


CK0153. —-_D28180 126 83.3 231 246 Nuclear antigen 21D7 Carrot 


Continued. 
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Table 34.2 Continued. 
Pe ip aie arel be wien, the bee) 


hae DDBJ Length Match Initial Optimized Putatively identified Original 
accessionno. inaa (%) score score protein name species name 
RA2290 D24637 127 82.9 493 503 Nucleoside diphosphate Spinach 
kinase I 
CK3114_ = D23625 65 92:2. 269 269 Oleosin KD 18 Maize 
RA0634 D23944 120 100 692 692 Oryzain © precursor Rice 
CK0180 D16134 135 87 483 492 Peptidylprolyl isomerase Maize 
RA1733 D24323 96 99 424 424 Phenylalanine Rice 
ammonia-lyase 
RA3349  D25146 106 925 524 524 Phospho-2-dehydro-3- Tomato 
deoxyheptonate 
aldolase (EC 4.1.2.15) 
CK3137 =D 23639 111 94.5 484 484 Phosphoglycerate kinase, Wheat 
cytosolic 
CK1074 D15678 217 90.1 385 385 Phospholipid transfer Rice 
protein homologue 
RA2304 D24644 115 94.8 607 607 Phosphoprotein Alfalfa 
Phosphatase (EC 3.1.3.16) 
type 2A 
RAO0067 D38976 120 100 os) B15 Polyubiquitin protein Arabidopsis 
thaliana 
CK2588 D16056 136 100 634 634 Proliferating cell Rice 
nuclear antigen 
CK0536 =D 15369 128 Oley 218 218 Pyruvate decarboxylase Maize 
CK1296 = D38834 140 91.8 563 564 Pyruvate kinase, cytosolic Potato 
CK2680 D23355 87 86.2 258 260 Rab25 protein Rice 
RA0665 D23963 ay 86 394 398 Rho1Ps = ras-related Garden 
small GTP-binding pea 
protein 
CK0385 015270 108 96.4 411 411 Ribosomal 5S RNA- Rice 
binding protein 
CK2415 =D23198 109 98.5 305 305 Ribosomal protein L2 Tomato 
(cv. Moneymaker) 
CK2471 D38915 113 82.4 293 293 Ribosomal protein L23a__— Rat 
CK2111 D16034 135 100 660 660 Ribosomal protein L3 Rice 
CK2450 = D23222 100 83.6 321 335 Ribosomal protein L36a.e Yeast 
(Schwanniomyces 
occidentalis) 
CK0634 D22628 115 92.4 347 354 Ribosomal protein L37a Turnip 
CK2214 ~=D23111 134 82.6 310 315 Ribosomal protein L38 Tomato 
(cv. Moneymaker) 
CK2322 D16041 131 O25 469 501 Ribosomal protein L7a Rice 
CK2268  D16039 95 89.6 352 352 Ribosomal protein S11 Maize 
CK1804 D22928 96 86 208 208 Ribosomal protein $13 Maize 
RA1809 = D24377 133 100 488 488 Ribosomal protein S14 Maize 
(clone MCH2) 
CK1575 022842 125 O12. 332 332 Ribosomal protein S16 Large-leaved 
lupine 
CK1930 D22980 151 89.1 271 284 Ribosomal protein 520 Rice 
CK1135"" * D15714 100 100 247 247 Ribosomal protein S21 Rice 
RA0038 D38974 138 85 437 437 Ribosomal protein S25 Tomato 
CK2303. D23139 152 81 355 357 Ribosomal protein S27 Rat 
CK2977_ ~—- 23541 66 100 207 207 Ribosomal protein S4 Potato 
RA0528 D23895 125 88.9 205 210 Ribosomal protein $5 Rat 
CK1004 D22713 101 90.5 218 218 RL5 ribosomal protein Alfalfa 
CK0904  D28208 134 98.1 506 506 SalT protein precursor Rice 
CK2540 D23280 109 82.4 509 511 Starch phosphorylase Potato 
RA1801 D24369 129 100 477 477 Sucrose synthase Rice 


ie te 5 be 


Continued on p. 796 
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Table 34.2 Continued. 


eee 


Clone DDBJ Length Match Initial Optimized Putatively identified Original 

name accessionno. inaa (%) score score protein name species name 

RA1410 24133 130 100 721 721 Sucrose-phosphate Rice 
synthase 

RA3035 D39209 140 100 401 401 Superoxide dismutase Rice 
(Cu-Zn) (clone RSODA) 

RA1625 D24279 119 pile 479 483 T complex polypeptide1 Oat 

RA0821  D24003 104 86.9 375 377 Tat-binding protein-1 Human 

RA0260 D38993 110 81.5 331 331 Tonoplast intrinsic Arabidopsis 
protein gamma thaliana 

RA2017. —-D24474 87 89.7 427 427 Trg-31 protein Garden pea 

CK2209 D23110 110 100 486 486 Triose-phosphate Rice 
isomerase (EC 5.3.1.1) 

CK2505  D23254 ils) 100 582 582 Tubulin @-1 chain Rice 

RA1623. D24277 119 Bobs 569 575 Tubulin B-6 chain Arabidopsis 

thaliana 

CK0176  D22531 76 100 204 204 Ubiquitin Rice 

RA0552. D23906 120 100 346 346 Ubiquitin extension White 
protein lupin 

RA2710 D24887 115 93 516 521 UTP-glucose-1- Potato 
phosphate 
uridylyltransferase 

CK1761_ = _D15975 119 84 496 496 Valosin-containing Mouse 
protein 

CK2490 D38921 135 87.3 303 308 Voltage dependent anion Wheat 
channel, VDAC 

RA0480 D23874 131 100 515 515 Ypt family Maize 


CK and RA are the cDNA libraries from growth phase callus in the presence of 2,4-D, and young root, respectively. The 


similarities of clones to the PIR database were examined using 


FASTA algorithm. The proteins that showed optimized 


scores greater than 200 were putatively assigned to the cDNA clones. 
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Fig. 34.3 Strategy for CDNA analysis. Poly(A)* RNA was 
isolated from each tissue and cDNAs were synthesized. 
Partial DNA sequences of randomly selected cDNA 
clones were determined and sequence similarities 
analysed. The details of each step are described in the 
text. 


amylase when compared with other calli. It was not 
surprising that heat-shock proteins were prominent 
in the cDNA library derived from heat-treated 
callus. Some clones, such as the DNAs for ubiquitin 
and elongation factor, were isolated to some extent 
from all libraries. 

These results clearly indicate that the composition 
of the clones in each cDNA library reflects the 
regulation of gene expression related to differenti- 
ation, growth conditions or environmental stress. 
Since each library includes a large number of 
uncharacterized cDNA clones, further investigation 
of the clones obtained from these cDNA libraries 
might provide new insights of the genes or proteins 
that play important roles in gene regulation in rice. 
Among the clones analysed, the percentage of 
unique clones was about 50% (Fig. 34.4). This indi- 
cates that many unique clones of expressed rice 
genes can be effectively isolated by large-scale 
cDNA analysis. 

The redundancy analyses within each library 
were also performed using the BLAST algorithm 
[19]. After BLAST analysis, if the clone included a 
50-bp region that showed more than 90% homology 
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Fig. 34.4 Relationship between redundancy and the 
number of analysed cDNA clones. Each dot shows the 
redundancy of analysed cDNA clones and the number of 
clones at the times indicated. 


or a 30-bp region that showed 100% homology with 
other clones, we regarded the clone as a redundant 
clone. Although the redundancy increased with the 
increase in the number of cDNA clones analysed, the 
increasing rate seemed to decline after the number 
of analysed clones became larger than 15000 
(Fig. 34.4). The results are considered to be due to the 
properties of the cDNA libraries that were analysed. 
The ratio of unique clones in the libraries might be 
higher than in previously analysed libraries. 

Many of the partially characterized cDNA clones 
are also effectively used as ESTs on our RFLP linkage 
map and for the construction of a physical map 
using YAC contigs (see Section 34.4). 


34.2.3 Toward the complete catalogue 
of all rice genes 


Large-scale cDNA analysis is a very useful method 
for the investigation of expressed genes in rice, and 
provides good probes for the construction of an 
RFLP or a physical map. In the RGP, we aim to cata- 
logue all the cDNAs of the expressed genes, includ- 
ing tissue-specific, developmental stage-specific and 
stress-specific CDNA clones and the cDNAs from 
various tissues, including calli grown in different 
conditions. We have partially sequenced over 20000 
cDNA clones and about 10000 clones (50%) were 
shown to be unique, as mentioned above. This 
means that we have already captured about one- 
third of the total expressed genes in rice. 

All of the partially sequenced cDNA clones were 
analysed for similarities to PIR database entries. 
Some genes were found to be expressed differen- 
tially in specific tissues. These data will be useful in 
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investigating the potential functions of such un- 
known proteins in plants and to widen the knowl- 
edge of protein families. 

Our current method of cDNA analysis continues 
to be effective in identifying novel unique clones, 
although the ratio of unique clones has decreased 
with the increase in the number of clones analysed 
(around 50% at present). Furthermore, to identify 
other tissue-specific genes, cDNA analysis of clones 
from several stages of panicle development are now 
in progress. 

As a new approach, we have been developing a 
high-density cDNA filter hybridization system 
which carries 768 (96 x 8 dot array) colonies per filter. 
By using these high-density filters, which include all 
isolated cDNA clones, we are now planning to 
screen all our cDNAs against each other. Simultan- 
eous screening of all the isolated cDNAs in the RGP 
(about 20000 clones) will become possible by this 
system. Such screening will rapidly give a lot of 
useful genetic information, such as redundancy or 
tissue specificity of each cDNA clone. Furthermore, 
we also plan to construct a similar system using YAC 
or cosmid clones as probes. These systems are also 
expected to have large advantages in map-based 
cloning (Section 34.5). 


34.3 PCR techniques for 
rice genome mapping 


34.3.1 RAPD analysis 


In the RGP, polymerase chain reaction (PCR) tech- 
niques are used extensively for detection of DNA 
polymorphism useful for linkage map construction 
[2,20,21], for screening YAC and cosmid clones, and 
for labelling probe DNAs, etc. For linkage analysis, 
we use both RFLP and PCR polymorphism techni- 
ques (see Chapter 5), although mainly we use 
sequenced cDNA clones, often characterized by 
similarity search. Using this strategy, we obtain map 
positions of genes with known functions, as indi- 
cated by similarity search. During the construction 
of our linkage map, however, some regions could 
not be mapped with cDNA clones. It seems that 
expressed genes are not equally distributed through- 
out rice chromosomes. For example, in the upper 
part of chromosome 9 in our map (Fig. 34.5), there 
was a largish region which could not be mapped 
with cDNA clones. 

The random amplified polymorphic DNA 
(RAPD) method [21-23] (see Chapter 5, Section 
5.4.2) can detect polymorphisms not only in coding 
regions, but also in noncoding regions, because 
RAPDs are amplified with genomic DNA templates. 
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Fig. 34.5 Linkage map of PCR markers. RAPD markers are shown with locus names. Markers without names are RFLP 


markers. 


For this reason we also used RAPD for linkage 
analysis. RAPD has also been used in other plants, 
such as Arabidopsis [24], soybean [25], black mustard 
[26], sugarcane [27] and alfalfa [28]. 

The RAPD method also has the following advant- 
ages. 

1 It needs little space and simple equipment: only a 
PCR machine, an electrophoresis system and a sys- 
tem for detection of amplified DNA. 

2 It needs a very small amount (nanograms for one 
reaction) of template DNA, much less than hybri- 
dization methods. This reduces the amount of 
sample plants needed, time for template DNA 
isolation, and the cost of handling materials. 

3 It detects many bands per lane in gels—that is, 
many loci can be analysed at one time. 

Normally, we use primers 20-25 nucleotides long 
in pairs for a simple PCR. Because these primers 
hybridize to a single locus of the genome, we can 
obtain specific DNA fragments from each locus. 
However, the best primer length is 9-10 nucleotides 
for the RAPD reaction. There are about 4° times 
more homologous regions for a 10-nucleotide pri- 
mer in the rice genome than for a 20-nucleotide 
primer. If we use such short primers for PCR 
amplification, many fragments are amplified at 
once. If mutations on primer annealing sites or large 


insertions / deletions between primer annealing sites 
exist in the genome, these loci cannot be amplified 
by PCR. If small insertions/deletions between 
annealing sites exist on the genome, the amplified 
fragments will show fragment length polymor- 
phism. This way we can detect DNA polymor- 
phisms that have become amplified in the genomes 
of different rice cultivars. 

RAPD reactions are conventionally carried out 
with a single short primer, but in our laboratory, we 
routinely use pairs of 10-nucleotide primers. If we 
used a single primer for each RAPD reaction, 5000 
different reactions would need 5000 different pri- 
mers. However, if we use primer pair combinations, 
we only need 100 different primers for the same 
number of reactions (100 x 100/2). There is also the 
additional advantage of detecting more amplified 
fragments. When we use primer A and primer B in 
one reaction, the amplified fragments can have three 
possible combinations of fragment ends: 

1 both ends are primer A; 
2 both ends are primer B; or 
3 one end is primer A and the other end is primer B. 

We can thus detect additional DNA fragments 
that have different primer sequences at opposite 
ends of the amplified fragment. 

For linkage analysis, polymorphisms are screened 
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at first between parent plants. Using primer sets 
which showed polymorphism, genotypes of each 
F2 plant are obtained and then linkage calculated 
between markers from each genotype by linkage 
analysis software MAPMAKER [29]. Figure34.5 
shows the RAPD markers mapped on the RFLP 
linkage map of Kurata et al. [2]. Many of our RAPD 
markers have been useful in that they are located in 
areas with only few RFLP markers. 


34.3.2 Determination of sequence-tagged sites 
from RAPD 


Gel profiles of amplified fragments containing 
RAPD are sometimes complicated. When nonpoly- 
morphic bands are observed near the RAPD band, it 
is difficult to determine the genotype from that 
RAPD band and to screen YAC and cosmid clones 
using RAPD fragments as probes. However, if 
RAPD fragments are cloned and sequenced, marker 
regions can be detected directly as sequence-tagged 
sites (STSs), which are also called sequence charac- 
terized amplified regions (SCARs) [30], by design- 
ing 20-mer primer pairs which amplify the target 
locus as a single band. We routinely clone RAPD 
fragments and design suitable STS primer pairs. If 
the amplified fragment with the STS primer pair 
[21,20] is used as a hybridization probe for RFLP 
analysis, it can detect codominant segregation for 
the linkage analysis. Moreover, because these STS 
primer pairs can amplify a single locus, these primer 
pairs are useful to isolate YAC and cosmid clones for 
making contigs. 


34.3.3 Bulked segregant analysis 
using the RAPD method 


The RAPD method can also be used to obtain 
markers that are closely linked to a specific target 
site. This method is called bulked segregant analysis 
[31,32]. Using this method, we were able to fill the 
gaps in linkage maps of chromosome 7 and 9. 
Bulked segregant analysis is an efficient method, not 
only for constructing many markers around a target 
locus, but also for tagging phenotypicly important 
loci with DNA markers. For gene tagging, we make 
two groups of F2 segregants according to the target 
phenotype. Genes for the phenotypes A and a are A 
and a, respectively. Group A has segregants whose 
phenotypes are A (genotypes are AA and Aa). Group 
a has segregants whose phenotypes are a (genotype 
is aa). Normally, we choose 10-15 segregants for 
making groups. 

To make mixed templates of groups A and a, 
respectively, rice leaves of segregants are mixed and 


DNA extracted or each group of template DNAs 
mixed. The mixed template DNA of group A has the 
DNA segment which has the gene A and a. The 
mixed template DNA of group a has the DNA 
segment which has the gene a. Other loci of gene A 
and gene a are the same between the mixed tem- 
plates of groups A and a. This means that poly- 
morphisms are detected specifically around the 
genes A (and a) on the chromosome. When 10 
segregants are mixed, the length of segments which 
have an A or an a gene is about 10 cM. These mixed 
(bulked) templates are then screened for RAPDs 
with many primers (or primer pairs). When a RAPD 
is detected in group A, the locus of the RAPD should 
be within 10 cM of the gene A. 

Using near-isogenic lines [33], as well as bulked F2 
segregants, this method can identify linked markers 
to important genes. For filling gaps and low-density 
marker regions in the RFLP map, one or two RFLP 
markers in the ends of these regions are chosen and 
groups of segregant DNA with genotypes of the two 
RFLPs are made. Using these templates, we can 
obtain RAPD markers in the target region. With this 
technique and 80 primers (by combination) we have 
set 18 markers in three regions in our RFLP map. 


34.3.4 Single-strand conformation 
polymorphism analysis 


It is difficult to detect polymorphisms between 
closely related rice cultivars, because there are few 
nucleotide differences between them. In such a case, 
more effective methods than RFLP and RAPD are 
necessary. Single-strand conformation polymor- 
phism (SSCP) analysis [34,35] is a simple and 
sensitive PCR-based method for the detection of 
polymorphisms (see Chapter 5, Protocol 7). Single- 
stranded DNA in a nondenaturing polyacrylamide 
gel adopts a specific conformation depending on its 
nucleotide sequence. This affects its running in the 
gel and even one nucleotide difference between 
DNA fragments can be detected by the SSCP 
method. We mapped 17 SSCP markers in our 
RFLP linkage map [36]. Heat-denatured fragments 
amplified by PCR were electrophoresed in non- 
denaturing polyacrylamide gel. Separated frag- 
ments were visualized by silver staining. In our 
experiments it was difficult to detect SSCP with 
fragments over 300 bp. Because large fragments 
move slower than short fragments, they are closely 
spaced in the gel, so that we cannot detect the 
difference in migration between the fragments. 

The SSCP method also is more costly in time and 
money compared with the RAPD method, because 
the SSCP method requires pairs of specific 20-mer 
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primers, while the RAPD method requires random- 
ly chosen 10-mer primers. However, the SSCP 
method is still effective for the detection of poly- 
morphisms between genetically close lines, for 
which it is difficult to obtain polymorphic markers 
by the RFLP method. 


34.3.5 Future of PCR analysis in 
plant genome research 


We have already mapped about 2300 RFLP, RAPD 
and SSCP markers in the rice linkage map; of these 
about 7% are PCR markers. The advantage of PCR 
markers is that YAC clones can be screened directly. 
This enables us to link genetic loci to the physical 
map. In high-density marker regions, there are 
already enough markers to make YAC contig screen- 
ing possible. For low-density marker regions the 
next step for producing markers will be screening 
markers in targeted loci by bulked segregant 
analysis. To accumulate more new markers, we need 
other methods for detection of polymorphism. 
Denaturing gradient gel electrophoresis (DGGE) 
[37] is a highly sensitive method of detecting 
polymorphism, which can detect even one point 
mutation. In plants, DGGE has been used to detect 
PCR fragments linked to the S-locus in rye [38]. 
However, the DGGE method costs more and takes 
longer than the RAPD method. The RAPD method 
has also been applied to detect differences of gene 
expression [39]. In the RGP, we have used cDNAs 
reverse transcribed from rice mRNAs from different 
tissues and growth stages. The transcribed genes 
were then amplified by RAPD using primers 10 
bases long. Differences of expression level can be 
correlated with the amounts of amplified fragments. 
This method is very useful for differential screening 
of expressed genes. 

In the RGP, much nucleotide sequence data on 
expressed rice genes has been accumulated in the 
past three years. From our sequence data, we have 
calculated decanucleotide frequencies and plan to 
adapt our construction of RAPD primers according- 
ly, so as to detect polymorphisms more easily and 
reliably. Moreover, using the decanucleotide fre- 
quency data, we can design primers flanked with di- 
or trinucleotide sequence repeats. These primers can 
detect fragment length polymorphism. These PCR- 
based techniques for detection of polymorphisms 
are simpler and more convenient than hybridization 
methods. Such methods could be adapted to detect 
polymorphisms in other plants. 


34.4 Genetic linkage map of rice and 
its applications 


Construction of a high-density linkage map with 
DNA markers is an important basis for genome 
analysis of rice as well as of other plant species. Here 
we describe the current status of a high-density 
linkage map of rice developed by the RGP using 
DNA markers [2]. General strategies and methods 
for linkage map construction and their application 
to gene isolation of rice will also be described. 


34.4.1 Mapping populations and DNA probes 


The initial choice of the segregation population is 
very important for linkage mapping of DNA 
markers. Several types of progenies, such as F2, BC1 
(first back-cross generation) and recombinant inbred 
lines (RILs), can be used. Each of these populations 
has its their own advantages and disadvantages. F2 
and BC1 populations can be easily obtained from a 
single cross of parental lines within a few years. 
However, leaf material for DNA extraction is 
limited. In this respect, RILs are very useful for 
linkage mapping with DNA markers. For example, 
as most chromosomal regions of these inbred lines 
are homozygous for one parental allele, self- 
pollinated seeds of these lines show no genotypic or 
phenotypic segregation, and so can be distributed as 
seeds to many researchers to share a mapping 
population. However, the main disadvantage of 
recombinant inbred lines is the time required for 
their construction. They are constructed by the 
single seed descent method from F2 individuals. It 
takes usually six or seven generations to establish 
inbred lines. Doubled haploid lines (DHLs) are 
another source of mapping populations. In rice, 
DHLs can be constructed by anther culture of F1 
plants. 

In general, wide crosses increase the probability 
of detecting polymorphism. In some plant species, 
crosses between a cultivar and a wild relative have 
been employed for linkage mapping. In rice, 
progenies derived from both intraspecific crosses 
[2,40-42] and interspecific crosses [3,43] have been 
used as mapping populations. 

An efficient supply of single-copy or low-copy 
sequences is one requirement for linkage mapping 
of DNA markers. A genomic library is one of the 
sources for such DNA probes. In rice, many 
randomly selected genomic DNA clones, derived 
from a Pst] genomic library, were used as probes in 
the first linkage maps constructed with DNA 
markers [40,41]. cDNA libraries are another effective 
probe source for single-copy sequences in rice [2,3] 
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and other plant species. Because of homology 
between related plant groups, DNA clones derived 
from other crop species, such as wheat [4], maize 
and oats [43], are also good probes for linkage 
mapping in rice. 


34.4.2 Constructing the rice linkage map 


The general procedure for linkage map construction 
in rice is shown in Fig. 34.6. To construct the RGP 
linkage map, we have used 186 F2 plants derived 
from an intraspecific cross between a japonica vari- 
ety, ‘Nipponbare’, and an indica variety, ‘Kasalath’. 
The total DNA of each individual has been extracted 
from frozen leaf tissue by the CTAB method [44]. In 
order to overcome a shortage of total DNA of F2 
individuals for mapping analysis, bulked F3 seed- 
lings derived by selfing each F2 individual have 
been used to restore F2 DNA. In our mapping 
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strategy, we have used mainly cDNA clones as 
probes. As of May 1996, we have screened two 
parental lines with more than 5000 DNA clones, 
mainly cDNA clones from callus and root. After 
extensive RFLP screening, informative clones have 
been used to score genotypes of 186 F2 individuals 
by Southern hybridization analysis. 

The first high-density and high-resolution rice 
genetic map was developed by RGP over three years 
[2]. In total, 1383 DNA markers have been mapped 
on the linkage map. The number of markers and 
their categories on the 12 chromosomes are shown in 
Table 34.3. The DNA markers are distributed along 
1575cM on 12 linkage groups and consist of 883 
cDNAs, 265 genomic DNAs, 147 RAPDs, and 88 
DNAs from other sources. cONA markers were 
derived from rice callus and root libraries. Genomic 
markers were mainly randomly selected DNA 
clones, with some No?I linking, YAC end and 
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Table 34.3 No. of markers and their categories in the high-density linkage map of rice [2]. 


Chromosome 

1 2 3 4 5 uf 8 9 10 11 12 Total 
cDNA from callus 67 51 59 38 45 41 41 25 25 28 28 19 467 
cDNA from root 63 48 61 32 34 45 38 24 18 15 le 21 416 
Random genomic 15 19 12 12 9 13 6 8 11 9 i 12 138 
NotI linking 18 Zz iL, 8 5 11 6 6 4 5 8 5 90 
YAC end 4 3 3 2 4 6 3 0 1 a 3 2 33 
Telomere 0 0 0 0 1 0 0 0 0 0 2 il 4 
RAPD 10 5 3 8 11 25 7 10 8 5 13 5 110 
Sis 4 il 2 3 2 3 4 D2 4 3 8 1 37 
Wheat 9 4 wh 4 7 Ti 4 4 5 0 3) 4 58 
Others 3 3 4 if 0 5 0) 0 0 1 3 4 30 
Total 193 «141 158 #114 118 1565) 109 79 76 68 97 74 1383 
Total length (cM) 191.8156) 16859129 123.7) 130:4128:7) "24:8 L005 e85:one23:3) SF 575 


telomere-associated clones (Table 34.3). RAPD 
markers were used for mapping to find out whether 
there is distribution bias for this type of marker and 
whether they can be used to fill marker-rare regions 
on the linkage map. A bulked segregant analysis [31] 
was used to develop markers in marker-rare regions 
(see Section 34.3). All the cDNA fragments and most 
of the genomic fragments were partially sequenced 
to convert them into STS (sequence-tagged site) or 
EST (expressed sequence tag) markers (see Section 
34.2). The mapped markers have been characterized 
in detail by Kurata et al. [2]. As of May 1996, we have 
mapped about 2300 DNA markers on the linkage 
map. Green shoot and etiolated shoot cDNA libra- 
ries have been used as probe sources. 

Analysis of molecular markers such as RFLPs is 
an effective way of revealing chromosomal rear- 
rangements, such as translocations, deletions and 
duplications. Kishimoto et al. [45] indicated, by 
linkage mapping of cDNA clones, that sequences 
between some regions of chromosomes 1 and 5 have 
been conserved. In our linkage map of RGP, many 
clones showed more than one band by genomic 
Southern hybridization and were mapped at dupli- 
cate or triplicate loci among the 12 chromosomes. 
Extensive conservation in linkage alignment of 13 
loci was observed between the lower distal regions 
of chromosomes 11 and 12 [46]. These conserved 
regions span 10cM and 11.8cM from the distal ends 
of the linkage maps of chromosomes 11 and 12, 
respectively. These results suggest that these con- 
served regions were generated by a duplication of 
chromosome segments. 

In addition, 58 wheat genomic DNA fragments 
have been mapped on the high-density linkage map 
in collaboration with the Cambridge Laboratory at 


the John Innes Centre, UK. As a result, we have 
established that wheat chromosome groups 1, 2,3, 4, 
6 and 7 clearly correspond to rice chromosomes 5, 4 
and 7, 1, 3, 2 and 6, respectively. Markers on wheat 
chromosome group 5 were mapped on rice chromo- 
somes 1, 3, 9, 11 and 12. Surprisingly, in all chromo- 
somes, most of the markers analysed showed the 
same linkage ordering between rice and wheat [4]. 
This preliminary synteny analysis of rice and wheat 
revealed high conservation for linkage ordering of 
DNA markers. Thus, the synteny map and the 
mapped rice probes will be very useful for the 
molecular analysis of the corresponding chromo- 
somal regions in wheat. Rice will be a very useful 
model plant for genome analysis for several cereals, 
such as wheat, barley, maize and rye. 

At present, four kinds of linkage map, constructed 
using independent probe sources, have already been 
constructed for rice [2,3,40,41]. Efforts to integrate 
these maps are in progress in collaboration with 
Cornell University. We are also developing a 
consensus framework linkage map using RILs, 
which have been constructed at Kyushu University 
[47]. As many research groups can share those RILs, 
the consensus framework map would facilitate an 
effective integration of all the information about 
linkage maps. Once a fully integrated genetic map 
with DNA markers is established, all the DNA 
markers will be very powerful tools, not only for the 
analysis of rice genome structure and function, but 
also for practical rice breeding [48]. 


34.4.3 Gene tagging for map-based cloning 


It is often difficult to isolate genes conferring parti- 
cular traits on a plant because of the lack of know- 
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ledge about gene functions and/or gene products. 
Map-based cloning is one method of isolating such 
genes, and has been used successfully in the tomato 
to isolate the disease-resistance gene pto and other 
genes [49]. Map-based cloning has also been 
employed in the RGP to isolate genes of biological 
and agronomic value. The first step of a map-based 
cloning strategy is the high-resolution linkage 
mapping of target genes. We have already mapped 
several genes for morphological and physiological 
traits, such as Xa-1 (bacterial leaf blight resistance), 
Se-1 (photoperiod sensitivity), Gm-2 (gall midge 
resistance; in collaboration with Dr Mohan, ICGEB, 
India), e.g. (extra-glume), Ph (phenol staining), Rc 
(brown pericarp and seed coat) and alk (alkali 
degeneration) on our high-density linkage map. 

Many rice geneticists and breeders have made 
linkage analyses using DNA markers. As a result, 
many DNA markers linked to useful genes have 
been identified. These include: genes for resistance 
to disease, such as rice blast [50,51] and bacterial leaf 
blight [52,53], genes for resistance to insects, such 
as brown plant hopper [54] and gall midge [55], 
a photoperiod sensitivity gene [56], photoperiod 
sensitive genic male sterility genes [57], semidwarf 
genes [58,59,60], scented kernel gene [61], a gene for 
accumulation of glucomannan in endosperm cell 
wall [62], etc. When we employ the map-based 
cloning method to isolate those genes, the linkage 
map of the target region will need to be of quite high 
resolution to select a single YAC or cosmid clone 
carrying the target gene (see Section 34.5). 

To get high resolution, a large number of 
segregating individuals and a high degree of 
accuracy for genotyping of the target locus will be 
required. The segregating population derived from 
crosses between near-isogenic lines and_ their 
recurrent parents is usually the best material. The F3 
line derived from F2 individuals should also be 
investigated in order to achieve a high reliability 
of genotyping for a target locus. However, this is 
time and labour consuming. The pooled-sampling 
approach is an alternative method for constructing a 
high-resolution map of the target region [63]. Appli- 
cation of this strategy requires several conditions, 
such as availability of a large segregating population 
and early and accurate classification of homozygous 
individuals for recessive or dominant alleles. When 
these conditions are met, it is possible to clarify 
the relative order of closely spaced markers, 
including the target locus of interest. In the RGP, we 
have made some progress in high-resolution linkage 
mapping for map-based cloning. A high-resolution 
linkage map of the Xa-1 region, which is our first 
target gene to be isolated, has been constructed by a 


combination of the standard and pooled sampling 
methods. 


34.4.4 Mapping quantitative trait loci 


In contrast to disease and insect resistance, many 
important traits in breeding, such as yield, culm 
length, heading date and eating quality, show 
continuous variation in progenies. Inheritance of 
those traits are controlled by several genes. It is 
difficult to identify those genes, known as quanti- 
tative trait loci (QTL), because the individual effects 
of the genes on phenotype are relatively small. 
Recent progress in isolating DNA markers and their 
linkage maps now enables us to analyse these 
individual QTLs [64]. The strategy for detecting 
QTLs using linked major genes was developed 
many years ago [65], but was difficult to put into 
practice using conventional genetic markers. So far, 
many QTLs have been clarified using DNA markers 
in various crop plants, such as tomato [66,67] and 
maize [68,69]. In rice, QTL analysis with DNA 
markers has been employed to detect genomic 
regions conferring cooked-kernel elongation [70] 
and partial resistance to blast disease [51]. In these 
studies, putative genomic regions determining such 
complex traits could be identified. 

From the beginning of the RGP, the feasibility of 
isolating genes at QTL with a map-based cloning 
system has been investigated. A large number of 
markers (857 loci) have been used to identify the 
QTLs affecting heading date, culm length, panicle 
length, etc. Many QTLs were detected, with a wide 
range of gene action on phenotypes, using the 
computer software MAPMAKER/QTL [71]. High- 
resolution QTL mapping also revealed evidence for 
the existence of multiple QTLs for the same trait in 
one chromosomal region and for specific gene 
interactions between identified individual QTLs, 
such as epistasis and suppression. Putative locations 
and DNA markers linked to QTLs will make 
marker-assisted selection feasible in rice breeding. 

In general, it was difficult to determine the precise 
location and gene action of individual QTLs. The 
low accuracy of mapping of QTLs is bad for a map- 
based cloning system. In tomato, overlapping 
substitution lines have been used to map QTLs 
precisely [66]. To overcome these problems, we are 
constructing well-characterized genetic stocks, such 
as near-isogenic lines, carrying one or multiple chro- 
mosomal segments of the parental line, ‘Kasalath’, in 
the genetic background of the other parental line, 
‘Nipponbare’ (Fig. 34.7). In this way, we will be able 
to combine/separate the desired/not desired 
chromosomal regions containing QTLs in selected 
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Fig. 34.7 Flow chart of the 
construction of near-isogenic 
lines (NILs) for the fine mapping 


individual plants. By using these substitution lines 
or near-isogenic lines for a given chromosomal 
segment, it will be possible to handle a given QTL as 
a single Mendelian factor. Thus we would be able to 
determine the accurate location of a given QTL on 
linkage map, to clarify the precise gene effects and to 
evaluate the genotype/environment interaction of 
individual QTLs. Once we succeed in mapping the 
genes of interest at high resolution, these genes will 
be isolated by map-based cloning. 


34.4.5 Future prospects for linkage mapping 


During the first three years of the RGP, a high- 
density linkage map using DNA markers has been 
constructed quickly. The linkage map and mapped 
DNA markers have already been used for physical 
map construction and for tagging genes with 
agronomic and biological interest. The number of 
mapped markers is already enough to embark on 
research into these topics. In order to progress 
further, it would be necessary to map DNA markers 
effectively in targeted chromosomal regions. We 
should also map cloned DNA fragments with 
already known function. 


of quantitative trait loci (QTLs). 


A detailed conventional rice linkage map has been 
compiled [1]. It is composed of many morphological 
and physiological marker genes and agronomically 
important genes, such as those for disease and insect 
resistance. Integration of the linkage map with 
conventional markers and with DNA markers is in 
progress [72,73]. In the RGP, our final objective is to 
construct a comprehensive genetic map including 
genes for morphological and physiological traits, 
genes for quantitative traits, as well as DNA 
markers. This genetic map will contribute greatly to 
rice genetics and breeding as well as to knowledge 
of the basic biology of the rice plant. 


34.5 Making the physical map 


34.5.1 Current state of the art in 
physical mapping 


The cultivated rices, O. sativa and O. glaberrima, have 
12 chromosomes (2n=24), carrying about 430 
million base pairs of DNA. The size of the rice 
genome is about 10-fold that of yeast and one-tenth 
of the human genome. Among plants, Arabidopsis 
thaliana has a genome three times smaller than rice, 
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while rice has the smallest genome among cereal 
crops; maize has a genome eight times larger, and 
that of wheat is 40 times larger (see Chapter 32). The 
relatively small size of the rice genome facilitates 
construction of a physical map. 

Two main results we want to obtain from a 
complete rice physical map are: 

1 information on the basic structure of the rice 
genome as a model for monocot plants; 

2 isolation of genes for agronomically or scientifi- 
cally interesting traits by map-based cloning. 

The first aim is based on the prospect that the 
establishment of a rice physical map, together with 
a high-density expression map, would be quite 
helpful not only for understanding the basis of 
monocot plant genome structure, but also for analy- 
sing genome evolution among various organisms. 
When utilizing a physical map, it is important to 
consider what kind of information one wants to 
extract from the map. From this point of view, the 
most useful information would be a map that has 
several hierarchical DNA contigs comprising long, 
medium sized and short DNA fragments, that is 
ordered YAC, cosmid and plasmid libraries overlap- 
ping each other. Furthermore, an expression genome 
map that has a complete array of genes (expressed 
sequences) on the ordered DNA fragments of the 
physical map should be the most informative com- 
prehensive map for resolving genome organization. 

The second aim in physical mapping is cloning of 
important genes that are often known only by 
phenotypic traits. Starting with the tagging of these 
trait genes by DNA markers located on the genetic 
linkage map, a detailed physical map of the target 
region will be needed for the next step. A large 
number of expressed sequences arrayed on the 
physical map, together with tagged DNA markers 
close to the target genes, make it possible to identify 
and clone the genes in a systematic manner. 

In addition to the detailed physical map, maps of 
the synteny with related plants may help in map- 
based gene cloning. Recently, synteny relationships 
between several cereal crops have been reported 
[43,74,75]. The synteny analysis between rice and 
wheat in particular showed a strikingly high 
colinearity of gene order in all chromosomes [4]. 
Further work on microsynteny in limited regions, 
which in other cereal crops carry genes determining 
important traits, may make it possible to isolate such 
genes in other cereals using the rice physical map. 

The cytogenetic map is also a kind of physical 
map. In the case of the human genome, such a map is 
being constructed by locating hundreds of DNA 
markers on chromosomes with fluorescence in situ 
hybridization (FISH) [76] (see Chapter 9). To visual- 


ize the location of genes directly on chromosomes 
or on isolated chromatin is an important way of 
generating a comprehensive genome map. In rice, 
several repetitive and single-copy sequences have 
been mapped on chromosomes by in situ hybri- 
dization [77-79]. However, detecting the exact 
location of single-copy sequences on the chromo- 
somes is still not feasible. One reason is the 
similarity in size and shape of the metacentric or 
submetacentric rice chromosomes and their small 
size. Effective discrimination of rice chromosome 
regions by, for example, banding patterns or other 
cytogenetic characteristics is not yet possible, except 
by the use of a sophisticated densitometry-based 
imaging analysis system [80]. 


34.5.2 Use of YAC, BAC and cosmid libraries 


The first step in the construction of a physical map is 
the preparation of genomic libraries. Several kinds 
of genomic libraries can be used for physical 
mapping, each having their own advantages and 
disadvantages both for library construction and in 
their utility for physical mapping. The three types of 
libraries most commonly used for physical mapping 
in other organisms are the libraries constructed with 
yeast artificial chromosome (YAC), bacterial artifi- 
cial chromosome (BAC) and cosmid vectors. YACs 
can be used to clone very long genomic DNA 
fragments, from several hundred kilobases to over 
one megabase. The construction of a high-quality 
YAC library and its evaluation is, however, difficult 
and time consuming, because of the frequent 
occurrence of chimaeric clones. Such difficulties, 
however, can be overcome by combining several 
strategies in physical map construction, as discussed 
below. 

After preliminary trials, we have constructed 
satisfactory YAC libraries [81] using protoplast cells 
of ‘Nipponbare’..The two YAC libraries currently 
used comprise 7000 clones with an average insert 
size of 350 kb genomic DNA. These libraries cover 
the rice genome around 5.5 times and should cover 
over 80% of total chromosome length when all 
clones are aligned along the chromosomes. About 
40% of the clones in the libraries are chimaeric. 
However, most of the chimaerism was observed in 
clones with inserts of over 400 kb, and was not 
frequent in the clones with smaller inserts. These 
libraries are now being successfully appplied to 
physical map construction [82]. 

In most cases, it is quite difficult to cover the 
whole genome with only one kind of genomic clone 
library. Therefore, the RGP also decided to make 
cosmid libraries. Cosmid libraries will be very 
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useful not only to fill the gap regions of YAC contigs 
but also to divide the long YAC clones into several 
cosmid clones for further analysis, both to isolate 
target genes and to do detailed physical mapping. 
Several cosmid libraries have already been con- 
structed [83], one with rice cultivar ‘Nipponbare’ 
DNA for physical mapping and others with different 
rice cultivars carrying interesting target genes for 
map-based cloning. A BAC library with inserts of 
about 150 kb DNA is also useful and easy to deal 
with for physical mapping. Wang et al. at the 
University of California at Davis have constructed a 
BAC library for map-based cloning of the rice 
bacterial blight resistance gene Xa-21 [51]. Physical 
mapping will also be made easier by using 
complementary information from several different 
genomic libraries. 


34.5.3 Map construction strategies and methods 


There are several strategies applicable to physical 
map construction. A complete physical map should 
be a map which includes all DNA of all 12 
chromosomes from one end to the other. For this 
purpose, it is best to obtain genomic DNA fragments 
as large as possible and order them along the 12 
chromosomes. Isolation of the 12 chromosomes 
independently would be the best way for making 
chromosome-specific genomic DNA libraries as the 
source of starting materials for physical mapping. 
Rice chromosomes, however, cannot be sorted by 
laser beam chromosome sorting (flow sorting; see 
Chapter 12), because of the very small size and 
continuous variation in length (maybe also in DNA 
content) among the 12 chromosomes. However, 
because of the small genome size of rice, it is possible 
to construct a genomic library of large DNA 
fragments and to order them directly on the 12 
chromosomes. In addition, the low proportion of 
repetitive DNA sequences (only about 50% of the 
rice genome) [84] makes it easier in rice to do 
cloning, selection and ordering of large DNA 
fragments to build up a physical map using a whole 
genomic library. 

How to order all of the YAC clones in the libraries 
along chromosomes largely depends on the genome 
structure of the organisms in question. For instance, 
the human genome, which has highly dispersed Alu 
family sequences, can be reconstituted with a large 
number of YAC clones walked by Alu-sequence- 
based PCR methods [85] (see Chapter 15). In 
addition, human chromosome-specific libraries are 
available, so that formation of cosmid contigs by 
determining cosmid overlaps through fluorescence- 
based digested fragment mapping is also possible. 


For rice, we have not yet found any special genome 
features that could be utilized for efficient physical 
mapping, except for the small genome size and the 
low proportion of repetitive sequences. Therefore it 
seemed to us best to use a high-density and high- 
resolution rice genetic map to order YAC clones 
corresponding to the mapped DNA markers on it, as 
indicated in Fig.34.8. Because we have already 
constructed a high-density 300-kb interval rice 
genetic map [2], it should be possible to cover the 
whole genome by selecting and ordering YAC clones 
with an average insert length of 350 kb. The practical 
methods for YAC contig formation used in RGP are 
as follows (see Fig. 34.9a, b). 

1 We make high-density colony filters carrying 
4x4x96 YAC clones on each filter. Our YAC libraries 
totalling 7000 clones can be spotted on five filters. 

2 We screen the YAC libraries by colony hybri- 
dization with all individual RFLP markers of 
already mapped genomic and cDNA clones on our 
high-density rice genetic map. 

3 DNAs of positive candidate YAC clones are 
further investigated by Southern hybridization 
analysis for detecting marker DNA sequences in 
them. The flow chart is presented in Fig. 34.9a. 

4 Where we use STS markers (derived mainly from 
RAPD markers), all 7000 cloned DNAs are divided 
into several pool combinations in 96-well microtitre 
plates for PCR screening with site-specific primer 
sequences (see Fig. 34.9b). 

5 After PCR screening of the first W pool, a second 
screening of X, Y and Z combination pools is 
performed to identify the position of the positive 
YAC clones. In Fig.34.9b, as an example, PCR 
products are amplified in X5, Y7 and Z4 pools, 
identifying the YAC clone that has the STS marker 
sequence. 

6 All the available data on YAC clones selected 
either by Southern hybridization or PCR screening is 
collected in the database for linking with other 
results useful in physical mapping. 

By locating YAC clones on corresponding DNA 
markers using these methods, the RGP physical map 
so far (March 1996) covers about a half of the 
genome. With the aim of introducing more effective 
methods, trials in the use of fingerprinting to order 
YAC clones are also in progress. Isolation of high- 
copy DNA sequences in the rice genome has been 
carried out in microsatellite assays [78,86], genomic 
fingerprinting [87] and analysis for short nucleotide 
repeats. In the use of these repetitive DNAs for YAC 
fingerprinting, their specificity for rice, but not for 
yeast, is a very important factor. One of the 
microsatellites and one short repetitive nucleotide 
sequence have been found to produce specific 
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Fig. 34.8 Construction of the physical map by using 
DNA markers on the high-resolution genetic map. All 
DNA markers mapped on the high-density and high- 
resolution genetic linkage map are very useful to select 


multiple banding in distinctive YACs. This should 
enable great progress in making YAC contigs. 


34.5.4 Map-based cloning of target genes 


Several examples of gene isolation through map- 
based cloning have been published recently in 
tomato [49], tobacco [88] and Arabidopsis [89,90]. 
Target genes for map-based cloning in rice are 
disease-resistance and insect-resistance genes, biotic 
and abiotic stress-tolerance genes, photoreactive 
genes and other genes of biological and/or bio- 
chemical importance (Sections 34.4.5 and 34.4.6). 
The strategy for map-based gene cloning in rice 
would be almost the same as that used in tomato and 
Arabidopsis. Easy focusing on the target gene in 
physical maps largely depends on how exact and 
near are the DNA markers tagging the gene. 

In rice, the total length of the genetic map is about 
1600 cM and that of the physical map is 430 Mb. This 
means that 1cM corresponds to about 270kb. A 
rough estimation of the number of expressed 
sequences in the rice genome tells us that 1 cM — that 
is, 270kb—contains 30-60 genes on average, 
although naturally the gene density varies from 
region to region. At first, one should pick up a long 
DNA fragment of YAC, BAC or cosmid which 
carries the closely linked DNA marker sequence(s) 
to the target gene. Screening and identification of the 
target gene among the many expressed sequences 
on such DNA clones seems the most critical step. If 
one can use a plant population large enough for 


and order YAC and/or cosmid clones. This should be the 
most reliable way to decide exact YAC overlaps on each 
chromosomes for physical mapping. 


segregation analysis, it is possible to select out 
several candidate clones which show no segregation 
with the target gene from over several tens of genes 
on one YAC. 

In the case of Xa-1, one of the bacterial leaf blight 
resistance genes in rice, we have been able to select 
as candidates about half of the expressed sequences 
from over 20 cDNA clones mapped on one 320-kb 
YAC using about 1000 F3 lines by bulking methods 
(see Section 34.4). This YAC clone was made from 
‘Nipponbare’ DNA, and could be screened with 
overlapping cosmid clones made from the resistant 
near-isogenic lines [53]. Several cosmid clones 
covering all five to six candidate cDNA sequences 
could be selected from the cosmid library of the 
bacterial blight-resistant rice strain. The next step 
should include sequencing of these cDNA and 
cosmid clones to see sequence differences between 
resistant and susceptible strains and transformation 
of the susceptible strain by these cosmid clones to 
see whether any clone restores resistance to the 
susceptible rice strain. 

In a similar way, other target genes are being 
tagged and then isolated as large DNA fragments in 
YACs. Although identification and characterization 
of target genes will take much more time, the 
isolated genes should be very useful for plant 
improvement through gene transfer into superior 
strains. Functional analysis of the isolated genes will 
also make it possible to compile biological databases 
for metabolic pathways. 
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sequencing. The rice physical map should be the 
most valuable resource in monocot plants, espe- 
Completion of the physical map will need not only _ cially in grass species, for further comparative 
several kinds of DNA libraries but also methods to _ analyses. 

fill the gaps and to find chromosomal rearrange- The physical map, the cytological map, the RFLP 
ments and to reveal the degree and sites of linkage map, the genetic linkage map of phenotypic 
structural complexity in the genome. The resulting traits and the expression map should be combined to 
physical map will be the main source for different generate a comprehensive genome map. Aiming to 
types of analysis of rice genome structure. Many __ build up such a genome map, we expect to include 
aspects of genome organization should be clarified the mapping of the over 20000 cDNA clones which 
by, for example, dissecting the whole genome into _ we will be isolating in our large-scale cDNA analysis 
functional and nonfunctional segments, surveying _ (see Section 34.2). It may be difficult to locate all the 
for common functional domains and their functions cDNA clones on the linkage map, especially in the 
both at the level of gene expression and chromatin/ —_ case of multiple-copy sequences such as those of 
chromosome structure, and also by whole genome _ isozymes and protein families. The difficulty comes 


34.5.5 Future directions for physical mapping 
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from sequence similarities in the CDNA clones, lack 
of polymorphism and the limit of resolution in 
linkage analysis. The fine mapping of almost all 
genes, however, should be possible using ordered 
YAC libraries. In practice, we can locate almost all 
multicopy and gene family sequences on multiple 
YAC clones in our system for YAC contig formation. 
In the future therefore we wish to use all cDNA 
clones which have not yet been mapped on the 
linkage map. Such a full expression map will 
contribute greatly to unravelling genome organiza- 
tion for gene expression and gene evolution in rice 
and other plants. 


34.6 Rice genome informatics 


There are around 50 staff at the RGP. The 11 ABI 
sequencers can produce 100000 bp of data daily, and 
every week hundreds of RFLP samples are pre- 
pared. The large amounts of data need to be moved 
around easily, and therefore a fast local network is 
important. Because the RGP was expanded rapidly 
to its current size, Macintosh computers were 
chosen for general use, as they are easy to learn to 
use. This section gives technical information on how 
the data processing and analysis have been arranged 
in our project. A simplifying factor for us is that we 
have the cDNA, PCR, RFLP and physical mapping 
laboratories in the same building, facilitating the 
planning and implementation of database activities 
from the beginning. This has made it possible to 
integrate the mapping and sequencing data for the 
rice genome analysis [91]. 


34.6.1 Data devices and 
raw data inputting and editing 


We have a Hewlett Packard 9000/8975 as the main 
computer (with 128Mbyte main memory and a 
23Gbyte hard disk) several SUN computers and 
some 40 Macintoshes (mainly Macintosh Quadras 
for desktop analysis, Powerbooks as laboratory 
notebooks) in our system. They are linked with a 
10 Mbyte s* local area network (LAN) with a star 
topology. The main host computer, HP9000, is linked 
to the LAN switching hub (LANplex 5012, 
Synernetics) at the speed of 100 Mbyte s*. 
Macintoshes communicate using AppleTalk 
protocol to the hub at a speed of 10Mbyte s”. 
Macintosh servers are easy to set up, and we have 
one fileserver on a Macintosh with a 4.5 Gbyte total 
memory for the database files and disk space 
allocated to researchers. Macintoshes communicate 
with the SUN computers, with the HP9000 and the 
Internet by EtherTalk, using the TCP/IP protocol. 


On the SUN computers we have set up our WWW 
server (address: http:// www.staff.or.jp) and our ftp 
server (ftp.staff.or.jp), and use a POP server on the 
SUN computer to deliver electronic mail to the 
Macintoshes, which are equipped with Eudora mail 
reading software. All Macintoshes have been given 
an IP number, so that they can access the World Wide 
Web by Mosaic or Netscape client software. The 
computer network system is shown in Fig. 34.10 

We have found the Microsoft Excel spreadsheet 
easy to use for almost all data input. After initial 
input and editing, the data files are imported to a 
relational database management system (RDBMS) 
called 4th DIMENSION running on Macintoshes. 
Often subsets of data are exported in text file format 
from 4th DIMENSION, imported again to Excel for 
manipulation, and graphs and statistics programs 
used via the Excel spreadsheet. The sequence data 
from the ABI 373 sequencers is likewise imported 
as text files from the network-linked Macintoshes 
controlling the sequencers. 


34.6.2 Database management: 4th DIMENSION 
interface and SyBase processing 


The 4th DIMENSION is a well-designed RDBMS 
that allows the designer to change the database 
structure easily and enables users to define desired 
display formats and to print and export files for 
many different purposes. 

With the ease of use of a Macintosh, starting 
database activities is easy, and the database struc- 
ture can be continuously expanded in a flexible way. 
Recently, server and client software has become 
available, so that Macintoshes on the network can 
share the common data without the need to have all 
the data in their own machine. In addition, local 
smaller subsets of data can be prepared for daily use 
by individual researchers with the same interface. 

The current in-house database in the RGP, 
RiceBase2 combines the data from the cDNA, PCR, 
RFLP and physical mapping groups. In developing 
RiceBase2, special atention was given to designing 
a good way of storing experimental information, 
especially sets of consecutive experiments and 
protocols [92]. 

The limitation of 4th DIMENSION is in the 
memory space and processing speed of the local 
Macintoshes (though PowerMacs have performed 
quite well recently). As our database keeps growing, 
we are moving it to the HP9000 main computer into 
a SyBase RDBMS. We use a client-server database 
system, using 4th DIMENSION on Macintosh as a 
client and SyBase on HP9000 as a server. This way 
we keep the user-friendly 4th DIMENSION inter- 


810 CHAPTER 34 RICE AS A MODEL FOR GENOME ANALYSIS 


(including fileserver) 


Macintosh 


Sun 


Link to NIAR researchers 


Computer room 


Macintosh 4 
| ost computer ; 
| For data analysis Inherit system Sun4 Mein hen X-terminal 
, n-house . 
and management Customized database search e-mail server For data analysis 


TISN/Internet 
4 : switch HUB Macintosh 
Router 10Mbps for every Autom ated 
- point to point similarity 
a search 
WAN of MAFF 
| Laboratory 


Macintosh etc. 
For data analysis and 
document preparation 


. Office 


and for data analysis 


Fig. 34.10 Computer hardware and network system in the RGP. 


face on the Macintoshes, which send the queries to 
the server where they are translated into the SyBase 
format SQL queries. This in-house database is called 
RiceBase3. It already needs 1 Gbyte memory for the 
SyBase tabulations of all data, with some 12 Mb of 
edited DNA sequence, information on some 30000 
clones and 320000 individual plant genotypes used 
in linkage mapping, and thousands of gel images 
(Table 34.4). Many of the RFLP gel pictures for the 
linkage map [2] have been put already on our WWW 
server at http://www:staff.or.jp. 


Item 


34.6.3 Toward a rice genome anatomy and 
international federated genome databases 


Over 10000 cDNA sequences are now available in 
the international database of expressed sequence 
tags (dbEST, a section of GenBank at the National 
Centre for Biotechnological Information, USDA, 
USA, see Chapter 37 and Appendix V for contact 
addresses), together with information on homolo- 
gies to other released ESTs. In the release 86.0 (15 
December 1994) rice is already the grass for which 


Table 34.4 Partial listing of 


Edited DNA sequences, nucleotides 

Number of cDNAs sequenced 

Clones (CDNA) 

Evaluated mapping genotypes (single plant individuals) 
Mapped unique loci 

Gel images 

Total number of YACs 


Amount information in the in-house 
database RiceBase3 of the 
12 000 000 Japanese Rice Genome Research 
20 968 Program, as at January 1995. 
35 000 
320 000 
1600 
5 000 


7 000 
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the most nucleotide sequence is available. All these 
sequences can now be linked to the large number of 
published Arabidopsis sequences and in fact to all 
gene products of conserved sequences. 

Already, over 1100 rice cDNAs have been success- 
fully converted to STSs and are used as standard 
landmarks in the RFLP map [2,20]. We also are 
progressing to integrate the YAC mapping data with 
the rest of the database. More modularity is needed 
in the database, and compatibility with the ACeDB 
software [93], which is the software adopted by 
many genome projects and also by the Plant 
Genome Database (PGD, at NCBI, USDA). It is pos- 
sible that we will also add on an ACeDB interface 
into SyBase, as has been done in the Integrated 
Genomic Database (IGD, see ref. 94). 

Whatever the interface to the researcher, inter- 
national cooperation is needed to harmonize the 
semantic structures in the large number of genome 
databases being built in laboratories of very differ- 
ent sizes and resources. One step in that direction is 
the standardized plant gene nomenclature [95], 
promoted by the International Society of Plant 
Molecular Biology since 1991. The standardized 
names will later also be required information in 
publications and in the international sequence 
databases (DDBJ, EMBL, GenBank). Some of the 
sequenced rice genes have already been putatively 
assigned such standardized names [96]. 

As for the rapidly accumulating plant cDNA 
sequences, already some 14000 Arabidopsis sequen- 
ces [12] and over 10000 rice sequences are available 
in the international databases. Of all the 30000 or so 
rice genes, some 30% have already been partially 
sequenced in the RGP. Most of these and other 
cDNA sequences from big projects will be available 
only through the databases and will not be pub- 
lished in detail in the scientific journals. Even so, the 
release of large numbers of plant cDNA sequences 
makes this information available for research being 
done on structurally similar proteins in other 
organisms. Conversely, many released plant cCONA 
sequences show homology to proteins previously 
known only in organisms other than plants, giving 
valuable information on the function and evolution 
of conserved proteins and metabolic pathways. 

In the future, the mere harmonization of a collec- 
tion of genome map databases will not be enough, 
since the genomes of even related organisms have a 
large amount of noncoding (but not all nonfunc- 
tional!) sequences in their genomes. Relational 
information indicating the relative positions of gene 
transcription units and regulatory elements is 
essential to build up genome anatomies [97] that can 
be compared effectively between organisms. The 


role of rice genome anatomy in such comparisons 
will be very important for all plant researchers and 
especially for cereal genome researchers. 


34.7 Rice as a model for 
cereal genome research 


As high-density RFLP maps of several cereal crops, 
such as rice, wheat and maize are constructed with 
common DNA markers, the comparison of loci for 
the RFLP markers on these maps becomes possible. 
Already, extensive colinearity of single-copy DNA 
markers has been established between rice and 
wheat [4] and between rice and maize [43]. 
Comparative RFLP maps of the homoeologous 
group 2 chromosomes of wheat, rye and barley have 
also been constructed [75]. In the case of rice and 
wheat, clear colinearity is found within almost all 
chromosomes, in spite of the large differences in 
genome sizes and numbers of chromosomes (see 
Chapter 32). 

The synteny between rice and wheat, based on 
our rice linkage map, is shown in Plate 11. Around 50 
markers from wheat were used to detect poly- 
morphism in a rice F2 population and their loci were 
determined on the rice RFLP map (see Section 34.4). 
About 50 markers from rice were also mapped on 
the wheat linkage map. Almost all the loci were 
identified as single-copy. This means that during 
evolution no duplications of these single-copy 
sequences have occurred in wheat or rice. 

Clear synteny between rice and maize has also 
been recognized. In comparison to wheat, the chro- 
mosome structure of maize is rather complicated. It 
has duplicated chromosomes and the regions homo- 
eologous to rice appear twice in the maize genome. 
Using rice as an anchor species, homoeology 
between wheat and maize could be elucidated 
easily. As for other cereal genomes, such as barley 
and millet, studies on synteny with the rice genome 
are currently in progress. The main advantages in 
using rice as an anchor species for elucidating 
synteny between various grasses are as follows. 

1 The homologous alignment of nucleotide 
sequences also means the homologous alignment of 
genes corresponding to phenotypic traits. If some 
phenotypic trait is once tagged by DNA markers in 
one of the cereal crops linked by a synteny map, this 
trait is expected to be present at the corresponding 
genome location in other cereal crops. For example, 
the dwarf gene is located in wheat chromosome 
group 4 and maize chromosomes 1 and 5, and the 
respective chromosome segments have colinearity 
to each other. The corresponding segment in rice is 
located in chromosome 3, but unfortunately until 
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now no such expressed gene has been located to this 
rice chromosome in the classical map. It may be that 
translocation of a short segment or a nucleotide 
replacement has suppressed the expression of the 
corresponding gene, or the gene has jumped to 
elsewhere in the genome. 

2 The isolation of genes responsible for phenotypic 
traits in cereals other than rice is feasible by 
screening rice genomic libraries (such as YAC or 
cosmid libraries) with DNA probes linked with 
phenotypic traits in the target species. For species 
with large genomes, such as maize, wheat and 
barley, construction of YAC or cosmid libraries is not 
easy, and even if constructed, screening of target 
genes might be difficult because of the presence of a 
large amount of repetitive or noncoding sequence. If 
the targeted trait shows physiological characteristics 
or signs of resistance against pathogens quanti- 
tatively similar to those in rice, the comparison 
could be done directly for gene isolation. Even if that 
trait is not expressed in rice, the conservation of an 
ancestral sequence within the corresponding region 
is expected. This approach is currently being used in 
an effort to isolate and characterize the Rpg1 gene 
(resistance gene to stem rust) in barley using rice 
YAC clones [98]. 

3 Information about synteny among the grasses is a 
powerful tool for studying evolution not only within 
grasses, but also between monocots and dicots. Rice 
has the smallest genome among the grasses and its 
chromosome structure is truly diploid. The high- 
density rice RFLP map has already been constructed 
as mentioned in Section 34.4. The nucleotide 
sequence analysis of randomly selected cDNA 
clones from rice callus library has enabled discovery 
of similar sequences in many other plant species and 
even many other eukaryotes [6]. Very similar 
sequences are found in cDNAs from other grasses, 
such as wheat, maize and barley. This suggests 
an extensive conservation, even in nucleotide 
sequence, among the grasses. For these reasons rice 
has been chosen as an anchor species for synteny 
mapping and will be a pivotal cereal crop for 
comparative genome research. 
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Section 6 Internet resources 


Section 6 


Introduction 


Stephen P. Bryant 


Gemini Research Ltd, 162 Science Park, Milton Road, Cambridge CB4 4GH, UK 


The anarchic Internet or World Wide Web (WWW), 
progenitor of the cyberpunk novels of the early 
1980s, has achieved, after a period of relatively slow 
growth, a dominant position in the dissemination of 
information around the globe. As a medium, the 
WWW is likely to exceed the printed page in 
importance in the not-too-distant future, particu- 
larly for information that needs to be timely rather 
than long-lasting. The first electronic peer-reviewed 
journals, such as GENE/COMBIS, have begun to 
appear, and the construction of a Web site is a 
prerequisite of any collaborative project which 
requires a timely, low-cost delivery of results to the 
community or the general public. 

As Martin Bishop points out in the opening to his 
chapter (Chapter 35), it is not possible to do mole- 
cular biology in any real depth without recourse to 
information technology and the Internet. For most 
researchers, it will be most useful as the gateway to 
large collections of dynamic data, characterizing the 
human genome with increasing resolution. What- 
ever the future of the Internet, it is vital that the 
researcher understands the basic principles behind 
it and the way it can add value to the work done. 
Like the telephone and the fax machine, it is an 
integral part of the way in which we now do science. 

In an ideal world, scientists would not have to be 
engineers as well, but would be gently led through 
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the database and analytical resources available on 
the Internet by interfaces that were completely 
intuitive and that did not require a degree in com- 
puter science to understand. However, reality is not 
like this at all, and researchers must be prepared 
to get their hands a little bit dirty, or else miss out. In 
Chapter 36, Jaime Prilusky has provided a guide 
that should help whenever it is necessary to dip into 
the ever-changing world of the computer operating 
system, in most cases UNIX. His is a first aid kit for 
the information age, the need for which will surely 
diminish as interfaces develop in their ease of use 
and utility. 

Finding out what is on the information superhigh- 
way is the subject of the last part of the section. Here, 
a selective directory (Chapter 37) has been produced 
of important genomic resources. If the reader spends 
some time investigating these sites, they will find 
that such a directory represents a window of code- 
pendent sites that shifts around as the priorities of 
the Human Genome Project change. It is hoped that 
the window will retain a value which will persist 
long enough to be used as a jumping-off point for the 
increasing pool of relevant information resources. 
The Internet is clearly here to stay, and these 
contributions have the modest aim of making it just 
that little bit less arcane, and just a little bit more 
accessible. 
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35.1 Introduction 


Genome analysis is not possible without the aid of 
the computer as a tool. This applies both to the 
analysis of local data and to accessing information 
worldwide. Most people are now familiar with 
computers, so a detailed description of their com- 
ponents and usage will not be given here. This 
chapter will give a brief survey of the facilities 
available to access information and data for genome 
analysis. Addresses for particular sites may be 
found in the relevant chapters, in Chapter 37, and in 
Appendices III and V. 


35.1.1 Computing hardware and operating system 


The engine of the computer is the central processing 
unit (CPU), which is located with its interfaces on 
the mother board and controls the processing of 
information. Information is stored in memory, 
which may be volatile (the random access memory, 
or RAM), or the nonvolatile memory forming the 
file-store. The latter is usually a hard disk and a 
floppy disk drive using magnetic recording which 
can be read or written to, and an optical disk drive 
for reading large amounts of prerecorded infor- 
mation. Peripherals connected to the computer are 
the screen, the keyboard and the mouse. There may 
be a variety of other peripherals such as printers. 

An operating system is a master control program 
which manages the function of the computer as a 
whole and the running of the application programs. 
The operating system runs continually while the 
computer is switched on and provides the means by 
which the user directs the operations performed by 
interaction from the keyboard and mouse. Interac- 
tion has evolved in recent years from the typing of a 
single command line, through full screen interaction 
at the level of individual characters, to the more 
sophisticated graphical user interfaces (GUIs) in use 
today. 

The most commonly used operating systems 
relevant to genome analysis are: 
¢ Microsoft Windows 3 and NT; 

e Apple Macintosh OS; 
¢ Unix X-windows (froma variety of vendors). 

Windows and Macintosh GUIs must run on the 
local machine, whereas in X-windows the graphics 
can be generated on one computer and displayed 
over the network to another. Thus, in Unix it is 
possible to have an X-terminal, which is a computer 
dedicated to the GUI and user interaction. With 
suitable emulator software, a Windows (e.g. Vista 
eXceed program) or Macintosh (e.g. MacX program) 
machine can act as an X-terminal. 


Historically, Windows and Macintosh machines 
were developed for office and home environments 
while Unix was developed for scientific and techni- 
cal environments. Today, the hardware costs are 
similar for a similar configuration, irrespective of 
operating system. The important considerations are 
to have CPU and RAM giving adequate response 
times, sufficient disk space, and a screen of the 
highest quality, with a minimum diameter of 17in 
and resolution of 1024768 pixels. To interact with 
X-windows programs, a three-button mouse is 
desirable. 


35.2 User interface 


The quality of the user interface makes or breaks the 
success of a software application. A user interface 
requires some input devices to enter the necessary 
information. The common input devices are the 
keyboard and the mouse, but many others have 
been devised. Determinants of a good user interface 
are speed of learning, speed of use, elimination of 
error, rapid understanding and attractiveness to the 
user. In the genomic field, we can contrast the now 
obsolete Sybase APT forms interface to the Genome 
Data Base (GDB) with the World Wide Web (WWW) 
interface. The former involves learning arcane key- 
strokes, whereas the latter is driven by mouse clicks 
and form filling. Unfortunately, there are conflicts 
between interfaces being easy to learn and fast to 
use, and being able to exploit the full power of the 
program. There are also subjective factors: people 
differ in the sort of interface they prefer to use. 

The command line interface is not encouraging to 
novices who have to wade through reams of docu- 
mentation in order to understand the replies to 
instructions typed at a C:\ prompt. Incorrect typing 
may lead to serious errors such as deleting the 
wrong files. 

Question and answer dialog, as, for example, in the 
text version of the Staden Sequence Analysis 
Package, assumes that the reader knows the answer 
to supply. If a mistake is made earlier in the 
sequence, it is not possible to go back. 

Menus present multiple choices and act as an aide- 
mémoire to the options available. They appeal to 
novices but their attraction rapidly palls as the user 
becomes an expert. 

Form filling with default answers allows the user 
to see and alter all the relevant items of data. There is 
a danger that inappropriate values may be provided 
if the user has failed to study the nuances of the 
analysis. 

Talking to the computer in a natural language is 
not a practical proposition with today’s technology. 
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Fig. 35.1 A filing system represented by icons. 
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the start of the ‘WIMP9’ revolution. Windows, icons, 
mice, and pull-down menus are standard in all three 
of the common operating systems listed above. 
Operations are invoked by actions performed on 
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visual representations — icons — of objects. The status 
of many applications running simultaneously can 
be displayed by the appearance of their icons 
(Fig. 35.1). For example, the arrival of new electronic 
mail can be signalled. Clicking on the icon with the 
mouse opens the mail folder and mail can be read in 
a window and a reply entered. Then the application 
can be reduced back to iconified form. 


35.3 Network connectivity 


A stand-alone computer can enable plenty of work 
to be done. However, it is not very interesting to 
receive electronic mail only from oneself. Increa- 
singly, it is the interconnectivity of computers 
worldwide that is the major interest of their users. 

A computer network consists of two or more 
intelligent communicating devices (personal com- 
puters, work stations, servers) linked in order to 
exchange information and share resources. A 
communicating device on a network is called a node. 
A host is a network node which provides individual 
network users with a variety of resources such as 
processing power, file stores, applications software 
and connections to other networks. 

Computer networks now span the world, and the 
earlier classification into local area networks (LANs) 
and wide area networks (WANs) is becoming less 
useful. The only concern of the user is the slowest 
link in the connection from his or her work station to 
the host of interest. There are two main aspects of 
network operation: the physical medium which 
implements the network, and the method of data 
transfer or network protocol. 

There are many transmission media in use today, 
including copper wire, fibre optic and microwave 
link. The most common media in the workplace are: 
1 fibre optic cables made of plastics or glass which 
serve as a very high performance transmission 
medium unaffected by electrical interference; and 
2 shielded or unshielded twisted pair (UTP) copper 
wire, which is also used in telephony for the less 
demanding applications. 

There are a variety of ways in which the electronic 
signals can be placed on the LAN. Ethernet is most 
widely used; it operates on a bus configuration of a 
single strand of cable to which each node connects. 
Only one signal can travel on the cable at one time 
and the transmission speed is 10 megabits per 
second (Mbps) for standard ethernet or 100 Mbps for 
fast ethernet. For situations where heavy traffic is 
expected, direct node-to-node communication is 
possible using switched ethernet. More recent 
technologies are fibre distributed data interface 
(FDDI) using a token passing ring configuration and 


asynchronous transfer mode (ATM) using switch- 
ing. These are capable of gigabit-per-second (Gbps) 
connection speeds. 

Special cards (e.g. ethernet cards) need to be 
installed in the PC or work station to connect it to the 
LAN. Repeaters, bridges, switches and routers may 
be used to implement the LAN and its connection to 
the WAN. 

It is possible to connect to analogue (voice) tele- 
phone channels using an external or internal device 
called a modem (modulator-demodulator) which 
converts digital data to analogue form. Speeds of 
about 20 kilobits per second (Kbps) are achievable. 

Telephone: networks are being converted to 
digital circuits called the Integrated Services Digital 
Network (ISDN) which offers digital services for 
both voice and data. ISDN cards permit connectivity 
at 64 Kbps. 

If a node is attempting to connect by a telephone 
circuit, the host needs to have a matching modem or 
ISDN card, and the network service provider will 
dictate the method used. Telephone: links are 
charged according to the time the connection remains 
open. It is also possible to install leased lines, where 
the connection remains open permanently and 
charging is according to a fixed monthly rental fee. 


35.3.1 The Internet 


The Internet is a collection of many national, 
regional, site, and individual network connections 
which all use the TCP/IP protocol suite. TCP/IP 
stands for Transmission Control Protocol/Internet 
Protocol and was originally developed for the US 
Defense Advanced Projects Agency (DARPA). The 
specifications of the protocol are publicly available 
so that any manufacturer can produce suitable 
equipment. This has led to TCP/IP becoming a de 
facto world networking standard. 

To connect to the Internet you need to find an 
appropriate Internet service provider. The service 
provider will then inform you as to the equipment 
needed to connect to the service. 

The service provider to the UK academic 
community is the United Kingdom Education and 
Research Networking Association, which runs a 
network called JANET (Fig.35.2) and an IP service 
called JANET IP Service (JIPS). If you work at a 
university or Research Council institute this is likely 
to be the service available to you. 

If you work on genome analysis in a hospital or at 
home you are likely to require a commercial Internet 
service provider. There are a number of these, includ- 
ing CompuServe, Demon, Pipex and Unit, who will 
specify the hardware and software which you need. 
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Fig. 35.2 Connection of the ICRF gateway (gw.icrf.ja.net) to the JANET network. SMDS means Synchronous 
Multimegabit Data Services and is used to implement the JANET backbone. 


35.4 Client-server computing 


The client-server model of computing is a means of 
distributing processing and graphical resources 
while sharing centralized resources such as file 
stores and data bases. Client computers issue re- 
quests and server computers respond to them over 
the network. These arrangements result in end-user 
operated facilities which can connect to remote 
systems throughout the world. The user is no longer 
communicating with a single computer, instead, ‘the 
network is the computer’. 

The components required for client-server com- 
puting are: 
e the LAN and WAN technologies described in 
Section 35.3; 
e desktop computing devices (personal computers, 
work stations, X-terminals or their emulations) 
described in Section 35.1; 
¢ a GUI environment based on X-windows tech- 
nology as described in Section 35.2. If a WWW 
client (such as Netscape) is used, it may not be 
necessary to have X-windows capability (depending 
on the services the user wishes to access); 
¢ servers with the resources of interest. 


35.5 Database technology 


Databases are essential resources for genome 
analysis. The human genome contains perhaps 
60000 genes encoding genetic functions (dependent 
upon proteins and RNA molecules) and comprises 
some 3000 million base pairs (megabase pairs, Mbp) 
of DNA. Hundreds of differentiated cell types make 
up the human organism, and there are thousands 
of mechanisms for regulating gene expression. To 
record the basic information and to explain repro- 
duction, development, function in life, and geneti- 
cally programmed death is a major challenge in 
biological understanding and information techno- 
logy which may be called bioinformatics. 

Databases and the knowledge-based technolo- 
gies required for bioinformatics are still an active 
area of computer science research. We do not yet 
possess all the tools needed. 

A database project consists of three components: 

1 developing the database structure that will permit 
storage and maintenance of the data; 

2 entering and maintaining the data; 

3 facilitating access by providing users with suit- 
able analysis and display tools. 
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The three stages must not be confused (although 
they often are). The expense of development is 
minor in relation to on-going maintenance and 
support. A common experience is of moving goal 
posts: the database specification changes faster than 
the developers and maintainers can work, leading to 
projects running over budget, being late, or failing. 

Genome analysis is at present in the data 
acquisition phase. Much of the data being collected 
is of ephemeral interest. (Will, for example, the 
contents of the St Louis YAC Library be of interest in 
5 years’ time?) The available data need to be 
analysed to discover what has to be represented in 
the database to produce good results. The problem 
for genome analysis is that there are many kinds of 
data from genetic mapping, physical mapping and 
sequencing which can only be linked if common 
markers are used. It is best to analyse the database 
requirement by working back from the desired end 
result: the human genome sequence, map position of 
phenotypes, and the nature of mutations of medical 
importance. This is merely the first step in under- 
standing the organism, and further databases or 
knowledge bases will be required for development, 
the localization of gene expression, and cellular 
function. 

Once the analysis is complete, the data are 
modelled to define how they will be internally 
represented in the database. This results in the 
conceptual schema of the database and is free of any 
assumptions of hardware and software. There will 
be a single conceptual view of the data. 

In the database as implemented, the user is 
presented with an external view of the data and 
there may be many such views. Important databases 
for genome analysis are the GDB for human data 
and the Mouse Genome Database (MGD). Both are 
implemented in a relational database management 
system (RDBMS) which is a commercial product 
(Sybase). 

Users of MGD do not have access to the database 
itself. Their view of the data is provided by a WWW 
browser system. In the case of GDB, users have a 
number of options. It is possible for them to perform 
Structured Query Language (SQL) queries which 
operate directly on the database to formulate 
questions of arbitrary complexity. In addition, there 
is a graphical user interface implemented in Galaxy 
and a WWW browser interface which provide views 
that are considered to be most frequently required. 

The relational database model is well defined 
mathematically, with proven characteristics. How- 
ever, it is slow and tedious to implement in practice. 
One approach to this difficulty is to build software 
tools to speed the implementation of the database 


and the development of the user interface. Such 
tools are becoming commercially available. 

Another approach is to move away from the rela- 
tional model towards an object-oriented model that 
considers the data as objects and classes that 
are more natural to the human perception of the 
problem. No commercial object oriented database 
management system is at present in popular use in 
genome analysis. However, a C language program 
written for the Caenorhabditis elegans Genome Project 
and called ACeDB operates along object-oriented 
lines. Acceptance of ACeDB by the genome analysis 
community relates to the appropriateness of the 
GUI rather than the robustness of the underlying 
storage and query method. The general user will be 
able to obtain all the information required by 
accessing the Web browser GUI of a genome dataset. 
Indeed, the power of such methods will increase 
with the introduction of executable code (aplets, that 
is small applications) running on the client, which is 
made possible by languages such as Java. 

In my view, the accurate maintenance of the data 
and the possibility of arbitrary queries made 
possible by SQL as well as database integrity (that is, 
ensuring the database is accurate, correct, valid and 
consistent) give the edge to RDBMS systems. 
However, the general user should not need to know 
they exist. 


From bryant@icrf.icnet.uk Wed Feb 28 13:41 GMT 1996 
Sender: spb@icrf.icnet.uk 

Organization: Imperial Cancer Research Fund 

X-Mailer: Mozilla 2.0 (X11; |; SunOS 5.3 sun4m) 
Mime-Version: 1.0 

To: Martin Bishop <mbishop@hgmp.mrc.ac.uk> 
Subject: Genome Analysis Book 
Content-Transfer-Encoding: 7bit 

X-Lines: 20 


hi Martin, 


Glad you are able to contribute to the book. 


Steve Bryant Tel: (+44) 171 269 3850 
Human Genetic Resources Laboratory Fax: (+44) 171 269 3801 
Imperial Cancer Research Fund 

Blanche Lane 

South Mimms 

Herts EN6 3LD UK Internet: bryant@icrf.icnet.uk 


Fig.35.3 An e-mail message. 
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Table 35.1 List of bulletin boards relevant to biology. 
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bionet.agroforestry 
bionet.announce 
bionet.audiology 
bionet.biology.cardiovascular 
bionet.biology.computational 
bionet.biology.grasses 
bionet.biology.n2-fixation 
bionet.biology.symbiosis 
bionet.biology.tropical 
bionet.biology.vectors 
bionet.biophysics 
bionet.celegans 

bionet.cellbiol 
bionet.cellbiol.cytonet 
bionet.cellbiol.insulin 
bionet.chlamydomonas 
bionet.diagnostics 
bionet.diagnostics.prenatal 
bionet.drosophila 
bionet.ecology.physiology 
bionet.emf-bio 

bionet.general 
bionet.genome.arabidopsis 
bionet.genome.chromosomes 
bionet.glycosci 
bionet.immunology 
bionet.info-theory 
bionet.jobs.offered 
bionet.jobs.wanted 
bionet.journals.contents 
bionet.journals.letters.biotechniques 
bionet.journals.letters.tibs 
bionet.journals.note 
bionet.metabolic-reg 
bionet.microbiology 
bionet.molbio.ageing 
bionet.molbio.bio-matrix 
bionet.molbio.embldatabank 
bionet.molbio.evolution 
bionet.molbio.gdb 
bionet.molbio.genbank 
bionet.molbio.genbank.updates 
bionet.molbio.gene-linkage 
bionet.molbio.genome-program 
bionet.molbio. hiv 
bionet.molbio.methds-reagnts 
bionet.molbio.molluscs 
bionet.molbio.proteins 
bionet.molbio.proteins.7tms_r 
bionet.molbio.proteins.fluorescent 
bionet.molbio.rapd 
bionet.molbio.recombination 
bionet.molbio.yeast 
bionet.molec-model 


bionet.molecules.peptides 
bionet.molecules.repertoires 
bionet.mycology 
bionet.neuroscience 
bionet.neuroscience.amyloid 
bionet.organisms.pseudomonas 
bionet.organisms.schistosoma 
bionet.organisms.urodeles 
bionet.organisms.zebrafish 
bionet.parasitology 
bionet.photosynthesis 
bionet.plants 
bionet.plants.education 
bionet.population-bio 
bionet.prof-society.afer 
bionet.prof-society.ascb 
bionet.prof-society.biophysics 
bionet.prof-society.cfbs 
bionet.prof-society.csm 
bionet.prof-society.faseb 
bionet.prof-society.navbo 
bionet.protista 
bionet.sci-resources 
bionet.software 
bionet.software.acedb 
bionet.software.gcg 
bionet.software.sources 
bionet.software.srs 
bionet.software.staden 
bionet.software.www 
bionet.software.x-plor 
bionet.structural-nmr 
bionet.toxicology 
bionet.users.addresses 
bionet.virology 
bionet.women-in-bio 
bionet.xtallography 
sci.med 

sci.med.aids 
sci.med.dentistry 
sci.med.diseases.cancer 
sci.med.immunology 
sci.med.informatics 
sci.med.nursing 
sci.med.nutrition 
sci.med.occupational 
sci.med.pathology 
sci.med.pharmacy 
sci.med.physics 
sci.med.psychobiology 
sci.med.radiology 
sci.med.telemedicine 
sci.med.transcription 
sci.med.vision 
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35.6 Network services 


Having described the context in which they operate, 
we now describe the services which are available on 


the Internet. These are evolving rapidly and one of 
the most popular, the World Wide Web, is only a few 


years old. 
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35.6.1 Electronic mail 


E-mail is used for communicating messages to other 
people worldwide. It is very convenient because the 
recipient need not be contacted prior to delivery and 
messages are usually delivered within a few 
minutes (cf. fax, which can be connected to e-mail). 
E-mail has been adapted to other purposes such 
as the transfer of small files or delivery of the out- 
put of programs which take a while to run (e.g. 
FASTA). 

A variety of e-mail programs are in common use, 
which vary considerably in their user interfaces. You 
will probably have a choice even on the same 
computer. It is unwise to run more than one mail 
program at a time, however—they will confuse each 
other! The WWW client Netscape 2.0 includes a mail 
facility. The common mail protocol on the Internet is 
called Simple Mail Transfer Protocol. 

An e-mail message consists of an ‘envelope’ with 
address details and the message contents (Fig. 35.3). 
Important components of the envelope are the 
identity of the sender and recipient, date and subject 
matter. It may be possible to attach files to the 
message and these need not necessarily contain 
plain text. However, before sending formatted 
files—for example, word-processed text—make 
sure your recipient has the appropriate programs to 
read or convert the file. 

One difficulty you may have with e-mail is 
finding the correct ‘address’ of the intended recip- 
ient. This information is harder to find than is a 
telephone number as there is no universal standard 
e-mail directory. In the first instance you may have 
to phone or fax to get the e-mail address. The form of 
the address is usually something like: persons_name 
@computer_address. The persons_name may be 
their computer user identifier, which is not neces- 
sarily anything meaningful. The computer_address 
may refer to a single machine or may be a domain. 
Many countries use a two-letter country code— 
for example, ‘uk’. In the UK ‘ac’ means academic 
community and ‘co’ means commercial. So to 
contact user support at the UK Medical Research 
Council’s Human Genome Mapping Project Re- 
source Centre, the address is: support@ hgmp.mre. 
ac.uk. In the US the country code is omitted (like the 
country name on British postage stamps). Main 
categories include ‘edu’ for education, ‘gov’ for 
government and ‘com’ for commercial. For ex- 
ample, to obtain information about the National 
Center for Biotechnology Information at the National 
Library of Medicine belonging to the National 
Institutes of Health, the address is: info@ncbi.nlm. 
nih.gov. 


Mail lists are a form of public communication 
which enable people with common interests to 
exchange ideas and information. By subscribing to 
the list you are sent mail from every contributor to 
the list. This is useful for groups of people working 
closely together. It becomes unsatisfactory if you are 
not interested in the majority of the messages and 
you have to wade through masses of junk to find 
your urgent or important mails. The solution is the 
bulletin board. 


35.6.2 Bulletin boards or newsgroups 


Electronic bulletin boards are like the departmental 
notice board or the corner shop window. Often, 
anyone can post a message but sometimes the 
postings are moderated. There are boards for a huge 
variety of subjects but you must be careful to post to 
the correct board. It is suprising how unpleasant 
people can be when ‘Flame’ wars arise on mail lists 
and bulletin boards. 

The advantage of bulletin boards over mail lists is 
that you can browse when you want and may be 
able to tell from the subject line what to avoid 
reading. There are usually a number of different 


Subject: Whitehead STS Mirror Site 
Date: 27 Feb 1996 06:19:54 -0800 
From: pwoollar@hgmp.mrc.ac.uk (Peter M. Woollard x4523) 
Reply-To: pwoollar@hgmp.mrc.ac.uk 
Organization: | UK MRC Human Genome Mapping Project 
Newsgroups: bionet.announce 


We are pleased to announce a European mirror to 


The Whitehead Institute/MIT Center for Genome Research 
Human Genomic Mapping Project: 


"An STS-Based Map of the Human Genome" 


Originating site: http://www-genome.wi.mit.edu/cgi-bin/contig/phys_map 
Mirror site: http://www.hgmp.mrc.ac.uk/cgi-bin/contig/phys_map 


Many thanks to Lincoln Stein for his help and patience in providing 
this. 


The mirror still requires some work, but we hope that you 
find this useful in the mean time. 


The WWW pages have an explanation of the data. 


Best Regards, 

Peter Woollard 
Computing Services Section, _ Internet: p.woollard@hgmp.mrc.ac.uk 
MRC Human Genome Mapping Project http://www.hgmp.mrc.ac.uk/ 
Resource Centre, Hinxton Hall, 
Hinxton, Cambridge, CB10 IRQ, UK Tel: ++44 (0)1223 494 523 


Fig.35.4 A bulletin board entry. 
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Connected to osmium. 


Password: 
230 
230- 


230- 
230- 


230 
230- 


ftp> dir 


total 4122 
drwxrwxr-x 
dr-xr-xr-x 1 
-1W-I--I-- 
“1W-T---- 
-1W-I--I-- 
“1W-I--I-- 
-1W-I--T-- 
-1W-T--I-- 
drwxr-xr-x 
Irwxrwxrwx 


ftp> get man.word.Z 


ftp> quit 
221 Goodbye. 


Fig.35.5 An anonymous ftp mbishop@hydrogen% 


mbishop@hydrogen% ftp ftp.hgmp.mrc.ac.uk 


220 osmium FTP server (Version wu-2.4 (2) Mon Apr 10 15:05:49 BST 1995) ready. 
Name (ftp.hgmp.mrc.ac.uk:mbishop): anonymous 
331 Guest login ok, send your complete e-mail address as password. 


230- Welcome to the UK HGMP Resource Centre anonymous ftp service 


Please contact support@hgmp.mrc.ac.uk regarding 
any problems with this service 


230-Please read the file README 

230- it was last modified on Tue Jul 5 13 : 41 : 56 1994 - 609 days ago 
230 Guest login ok, access restrictions apply. 

ftp> cd manuals/handbook 

250-Please read the file README 

250- it was last modified on Fri Apr 28 10 : 21 : 41 1995 - 312 days ago 
250 CWD command successful. 


200 PORT command successful. 
150 Opening ASCII mode data connection for /bin/Is. 


226 Transfer complete. 
642 bytes received in 0.2 seconds (3.2 Kbytes/s) 


200 PORT command successful. 

150 Opening ASCII mode data connection for man.word.Z (393843 bytes). 
226 Transfer complete. 

local: man.word.Z remote: man.word.Z 

395611 bytes received in 2.1 seconds (1.9e+02 Kbytes/s) 


other 512 Sep 28 09:04 . 

other 512 Nov 23 16:28 .. 

other 306 Apr 28 1995 README 

other 169001 Apr 28 1995 man.asc 

other 558008 May12 1995 man.ps 

other 177322 May12 1995 man.ps.Z 

other 737792 May12 1995 man.word 

other 393843 May12 1995 man.word.Z 

other 1024 Sep 23 15:50 manold 

other 6 Sep 28 09:04 manual -> manual 


session. 


conversations (‘threads’) going on simultaneously. 
A threaded bulletin board reading program helps 
you to read about a single topic. Because of the 
mechanism by which bulletin board information is 
propagated around the world, the postings are quite 
likely to be in the wrong order. 

On the Internet, bulletin boards are called 
‘Network News’ or ‘Usenet’ and individual boards 
are known as ‘newsgroups’. The protocol by which 
they are propagated is called Network News 
Transfer Protocol. The ratio of pearls of wisdom to 
dross should be monitored on a regular basis 
(otherwise you can waste a lot of time). The bulletin 
boards relevant to genome analysis are given in 
Table 35.1. 

The structure of a bulletin is rather like a mail 


message, with an envelope and contents (Fig. 35.4). 
Network News contains little of lasting value, and 
articles are rapidly expired at most sites where you 
can read the bulletins, because of the large amounts 
of disk storage required to hold them. So if you find 
Network News useful, you should read it on a 
regular basis. 

There is a wealth of news-reading programs (as 
with mail) and I have mentioned the threaded 
variety. Choose the most convenient reader avail- 
able on your system. Netscape has the ability to read 
news. 


35.6.3 File transfer 


File transfer is a way of transferring larger files, 
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be ee : 


‘Quit || Other Commands 
Select an item from a list below 


HGMP Gopher Information Service 


| 
! 
‘ 


: 
i 
| 


About This Gopher 
» HGMP Application and Materials forms 
HGMP Training and Seminars 
» HGMP Databases 
» Miscellaneous information 
Bw Other Gophers & FIP Sites 
» Biological Databases 
» Co-operative Human Linkage Center (CHLC) 
» DOE Human Genome Program Report 
» Finnish EMBnet BioBox 
» Indiana University, Biology Software & Data archive 
» Looking For Biologists (DOE, NIH, NSF, USDA) 
» Searching the Genamne Data Base (JHU‘’s GDB) 
Welchlab, Johns Hopkins University 


Info about |: Previous 
|| directory : 
Bookmarks 


HGMP Gopher Information Service 
» Human Genome Mapping Project Gopher Service (UK) 
» Human Genane Mapping Project Gopher Service (UK) 
services 


Fig.35.6 The HGMP Gopher Information. 


especially program files and large data files, from password, but the most useful form of ftp is 
one computer to another. On the Internet, the file anonymous ftp, by which you can access public files. 
transfer protocol is called ftp. It includes a These hold an enormous range of information, the 
mechanism for identifying the user and giving a —_ most useful for our purposes being genome data 
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prev rop e ee 
f Gale Set ut { i ee i Riser ne : ae 
iB nee : Foes 2 ! i ! : 


a 


: 


Ge 


Imperial Cancer 
Fete Fu ue 


Welcome to the Imperial Cancer Research Fund! 


The Imperial Cancer Research Fund is a charity and Europe’s lergest ndependent cancer 
research institute - employing over 1,000 scientific and clinical staff in its own laboratories and 
units based in hospitals and universities across the United Kingdom. We cerry out over 
one-third of all cancer research in this country and rely overwhelmingly on public support and 
generosity. 


Impertal Cancer Research Fund 
PO Box 123 

Lincoin’s Inn Fields 

London 

WC2A 3PX 

United Kingdom 


Telephone: 44-171 242 0200 
Fax: 44-171 405 1556 


wapecencesoucedovssnepsasenssescescsrareseneooswenanssacecesnenssbossoescannvsnecsnaveevecstSsebepessseneeveceersse9seerses ce eeseeetetseseneereeePetares® PueeenereSeawenese we reTen WOPERIG ADRESS LOLSSSOPPOSSSLOTSIOLACOCSELOROS IA NEPOLOOLS EY 


New Item - Janaary 1996: CRF employment opportunities: Postdoctoral Research Fellowships 
New Item - February 1996: CRF employment opportunities: Postdoctoral Research Fellowship 


Fig.35.7 The ICRF WWW server. 


and computer programs for genome analysis. To give login name ‘archie’ and command ‘help’. 
find out what is available, users can access a data- Archie clients are also available from the many other 
base called Archie. To use Archie, telnet (see Section archie servers and other software sites. 

35.6.6) to archie.doc.ic.ac.uk and when connected Once you have located files of interest you can use 
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Fig.35.8 The Webcrawler front page. 


the program ftp to download them to your system. 
An ftp session is shown in Fig.35.5. You log in as 
‘anonymous’ and give your e-mail address as the 
password. The command dir shows you the files 


SP haba nA SRE 


Menara arrnong 


{LAOREET 
ir 


? 


available. If directories are listed (letter ‘d’ on the left 
of the permissions string) you can go into them with 
the cd command. When you have found the file or 
files you want, specify whether you want files 
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transferred in text mode (ASCII mode), for text files, 
or binary mode, for programs. The command get 
will then transfer the file of interest to your 
computer (Fig.35.5), or use mget if you wish to 
transfer multiple files. Gopher (Section 35.6.5) or a 
Web browser (Section 35.6.6) can also be used to ftp 
files, and they have a more user-friendly interface. A 
short guide to anonymous ftp can be obtained by 
sending the message ‘help’ to ‘info@sunsite.unc. 


£ 


edu’. 


35.6.4 Gopher 


There is a huge amount of information on the 
Internet, much of which can be accessed by ftp. 
However, there are now more sophisticated 
methods of locating the information you require. 
Gopher was developed at the University of 
Minnesota to help find answers to questions on 
campus. It has been taken up worldwide to deliver 
documents to the user from a multitude of servers at 
centres which cooperate in providing this service. 
The client-server model is used and the gopher 
program presents a hierarchy of documents. The 
information around the world appears to the user as 
a single resource with a simple mechanism of 
browsing through it (Fig.35.6). Key word searches 
can be made, as servers have full-text indexes for 
sets of gopher documents. This is often imple- 
mented using the Wide Area Information Server. 
Gopher clients are available for all three of our 
named operating systems. Gopher clients for most 
types of computers are available from the University 
of Minnesota by connecting to bobombox.micro.umn. 
edu. Worldwide Gopher sites can be searched using 
the Gopher tool Veronica. Gopher is a very valuable 
source of information, but has been replaced by the 


WWW. 


35.6.5 World Wide Web 


The WWW is an easily accessible set of hypertext 
images and information available around the world. 
There is a huge and growing amount of biological 
information, databases and analysis tools which 
may be accessed by WWW (see Chapter 37 and 
Appendix V for useful addresses). Normally, WWW 
is available to anyone, but it is also possible to have 
services with access limited by username and 
password and to charge for such services. 

WWW is a client-server system. You run the 
client, called a browser, on your desktop computer. 
You open a connection to a server by specifying its 
universal relative locator (URL). For ICRF this is 
http://www.icnet.uk/. This will take you to the 


front page or ‘home page’ as shown in Fig.35.7. A 
variety of browsers are available, such as Mosaic, 
Netscape, Lynx (character-based, no graphics), 
tkWWW and HotJava. Mosaic software, developed 
at the University of Illinois, is available by anony- 
mous ftp from ftp.ncsa.uiuc.edu or by Gopher from 
gopher.ncsa.uiuc.edu. The widely used Netscape 
software can be obtained from ftp.mcom.com. When 
you access a server, a connection from the client is 
established, a page of information is delivered, and 
then the connection is closed. 


mbishop@hydrogen% telnet menu.hgmp.mrc.ac.uk 
Trying 193.62.192.50... 

Connected to tin. 

Escape character is '4]’. 


UNIX(r) System V Release 4.0 (tin) 


login: mbishop 

Password: 

Last login: Tue Mar 5 10: 28 : 13 from hydrogen 

Sun Microsystems Inc. SunOS5.4 —_ Generic July 1994 


** If you don't get the menu, please log out and try again later ** 
** If you still have problems, contact user support on 01223 494520 ** 


If you used the command ‘telnet’, and your terminal is running 
X-Windows, you can start the using X-Windows versions of the 
programs at the HGMP. 


You may also need to set the display permissions on your 
machine by giving the local command ‘xhost' followed by a list 
of machines at the HGMP (see the Computing Handbook for details). —_| 


Do you wish to use X-Windows (y/n) >n 
Setting up environment... 


Starting menu... 


MOLECULAR BIOLOGY SOFTWARE FOR THE HGMP-RC 


MAIN MENU 
>>> > You Have NEW Mail. Choose Option 2 
0) Help | 
1) Exit 


2) — Electronic Mail 
3)  BIOSCI/Network News (Biologist's Bulletin Boards) 
4) Information Services 


5) Analysis and Manipulation of Sequences 

6) Sequence Database Searching 

7) Genome Data 

8) Linkage Analysis 

9) Cell Lines, Clones & Probes Databases 


10) Other Molecular Data 
11) Utilities (File Transfer & Management) 
) UNIX Operating System 
13) Miscellaneous (‘How to...’ etc) 
) Queries, Suggestions and Comments to User Support 


Enter a number, option-name or ? > 


Fig.35.9 A telnet session accessing the HGMP menu. 
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Search: 
In Class¥ Ready - 


\costaseedne 
Sequence dap Locus 
Allele Clone Antibody 
Antigen Strain Paper 
Author Colleague Journal 
Source Keyword View 
Method KeySet Model 


Gene NucleotideSeq PeptideSeq 


Global Search?’ 


hist hcher 


'H.leprae physical map, contig 35 
Highlight... 


| aroD 112.50 [8,50] 


Attach... 


be Molecular_information Clone B93? 
aroD 
7130 
; Sequence EMBL :X59509_cds2 
Remark 3-dehydroquinate dehydratase ‘EC 


4.2.1,10) (3-dehydroquinase) 
Catalytic activity: 3-dehydroquinate = 
3-dehydroshikimate + H(2)0 
Pathway: third step in the biosynthesis 
from chorismate of the aromatic amino 
acids (the shikimate pathway? 
Map Ml_ctg35 Position 212.5. Ernon, 76.5 
Reference The Mycobacterium tuberculosis 
shikimate pathway genes: evolutionary 
relationship between biosynthetic and 
catabolic 3-dehydroquinases. 


90 


100 


110 


120 


Fig.35.10 An ACeDB session derived from an X-windows telnet access. 


The pages are written in hypertext markup clicking on these links you are able to navigate the 
language (HTML) which is interpreted before being _ net. There are also searching facilities which enable 
displayed on your screen. The pages may contain _ you to look for sites of interest. Webcrawler (http: 
hot links to other pages or to different sites. By  //webcrawler.com/) is one example (Fig. 35.8). 
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In addition to delivering pages, the Java language 
enables app lets to be written, which are programs 
which run on the client. This is a relatively new 
development and we can expect to see many 
applications in the future, distributing the com- 
putational load to the clients, and reducing the need 
for telnet sessions. 


35.6.6 Telnet 


Telnet enables you to log in to a remote host from a 
terminal session and to use any of the programs 


available on the host. To log in you will need to be 
provided with a user name and password by the 
host administrator. The nature of the facilities 
available will depend on the terminal you are using. 
An example session accessing the HGMP menu is 
shown in Fig. 35.9. 

In order to use graphical programs you need an 
X-terminal or emulation. The ACEDB software 
requires X-windows, and an example display from 
MycDB (the Mycobacterium database) is shown in 
Fig. 35.10. 
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Chapter 36 UNIX system 
survivors’ guide 
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36.1 Introduction 


Programs used for genome analysis will often be run 
in a UNIX environment (see Chapter 3). This brief 
guide lists some of the basic commands you will 
need. Sources of further information are listed in 
Section 36.8. An example of a session using UNIX 
commands is given in Chapter 3. 


36.1.1 Golden rules 


There are two rules that you must remember when 
working in a UNIX environment. 

° Case matters. You must provide the command or 
filename in exactly the same way as it exists, for the 
computer to recognize it. ‘FILE’ is different from 
‘File’ and from ‘file’. 

° File overwritten/deleted is overwritten /deleted. 
There is no way back after deleting a file or saving a 
file using the same name of an existing one. There is 
no way back. 


36.2 Basic UNIX commands 


The list below presents some examples of basic 
UNIX commands. To use them: 

° Type boldface text literally. Text that appears 
below in boldface are UNIX commands and should 
be typed as it appears in this text. 

¢ Substitute actual value for italicized argument 
names. 

° Optional arguments are in square brackets. 

¢ Repeatable arguments are followed by ellipses 
ens 

° Text appearing within parentheses describes the 
action to be produced by the preceding command. 


cat—Reads each filename in sequence and 
displays it on the standard output 

% cat file1 file2 —(displays the contents of file] on 
the standard output) 

% cat file1 file2>file3—(concatenates the 
contents of filel and file2 and places the result 
into file3) 

cd —Change working directory 

% ed — (return to your login directory) 

% ed directory—(change working directory to 
directory) 

cp — File copy 

% cp file1 file2—(make a copy of filel and name it 
file2) 

% ep filel... dirname —(copy one or more files into the 
specified directory) 

(% cp gamma.seq temp.txt creates a new file—or 
writes over an existing one—called temp.txt that 
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contains exactly the same data as the gamma.seq 
file) 

du—Summarize disk usage 

% du [name] — (disk usage for directory or file name) 
grep — Pattern searching utility 

% grep pattern filename —(looks through filename for 
the pattern character pattern and displays the lines 
that match the pattern) 

jobs — Lists jobs 

% jobs — (to see a list of all the jobs that are running 
in your login shell) 

kill — Terminate a process 

% kill number... — (terminate process number) 
logout or exit—Ends your UNIX session. This 
command does not need extra arguments. Simply 
enter one of them ina line to close your current logiu 
shell 

% logout — (ends your current UNIX session) 

% exit— (ends your current UNIX session) 
Ipr—Send a file to printer 

% \pr -Pprintername filename —(send file filename to 
printer printername) 

Is — List contents of directory 


% Is name ... —(list contents of directory and in- 
formation on files) 
% Is -1 name ... —(use long format) 


man — Print On-Line manual 

% man command —(displays on screen the pages of 
the manual corresponding to command. Can be 
used both to obtain information on the UNIX system 
itself and on the programs from the GCG package). 
mkdir— Make a directory 

% mkdir dirname ...—(create directory dirname in 
your current directory) 

more — File displaying utility 

% more filename — (displays the filename file on your 
terminal, one screen ata time) 

mv— Move or rename files and directories 

% mv filel file2— (change name of file! to file2) 

% mv dirname newdirname— (change name of 
dirname to newdirname) 

% mov filel dirname—(move file1 from its position 
into directory dirname) 

passwd — Change login password 

% passwd — (will ask you for your current password 
and for the new one) 

ps— Process status 

% ps — (prints information about active processes) 
pwd — Print working directory 

% pwd — (displays the full pathname of your current 
directory) 

rm— Remove files or directories 

% xm filel ... — (delete file1) 

% xm -r dirname—(deletes dirname and all its 
included files) 
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tail—copies the last portion of a file to the standard 
output. If no other option is given it will show the 
last 10 lines of the file 

% tail filel — (displays the last 10 lines of file1) 

% tail -30 filel (displays the last 30 lines of file1) 

% tail -30c filel (displays the last 30 characters of file1) 
who— Who is on the system 

% who — (list on-line user’s login name, terminal 
and login time) 

id—Displays your current user id. It does not 
require any other option or parameter 


36.3 Basic UNIX activities 


36.3.1 Your current directory and its files 


Your current directory refers to whatever directory 
you are currently working in. To see the name of 
your current directory, use the % pwd (print 
working directory) command. To list the files in your 
current directory, use the % Is (list) command. 


36.3.2 Entering file names 


If you enter the filename gamma.seq, UNIX assumes 
that you are referring to a file in your current 
directory. You name files in other directories by 
including the directory path. A typical directory 
path has slashes (/) that separate each subdirectory. 
(You can think of a subdirectory as a hanging folder 
in a file drawer, or the directory.) To refer to files in 
other directories, include the directory path with the 
file name. 

For example, typing: 

% more /usr/users/maps/filename.txt 

means that you want to ‘view’ the file filename.txt 
located in the directory /usr/users/maps, where 
maps is a subdirectory of users, and users is a 
subdirectory of usr. 

To list all the files in another directory, use the % Is 
command with a directory path. For example, typing: 
% |s /usr/users 
lists all the files in directory /usr/users. 


36.3.3 Working in other directories 


In general, you should work in your current direc- 
tory, just as you should work at your own lab bench. 
However, there may be files in other directories that 
you want to read or copy. 

You can change from your current directory to 
another directory by using the % cd command. For 
example, typing: 

% cd /usr/users/burgess 
means you want to change your current directory to 
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the directory /usr/users/burgess. 

You can also use a relative path to refer to directories. 
For example, if your current directory is /usr/ 
users/map and you want to change to the directory 
/usr/users/map/test_sequences, you would type 
% cd test_sequences 

to move down to that directory. 

If your current directory is /usr/users/map/ 
test_sequences and you want to change back to the 
directory /usr/user/map, you would type 
% cd. 
to move up to that directory. (Note that the ‘..’ refers 
to the directory above your current directory.) 

If your current directory is /usr/users/map and 
you want to change to the directory /usr/users/ 
compare, you would type 
% cd../compare 
to move over to that directory. 


36.3.4 Controlling program execution 


With UNIX, you can run several jobs from the same 
terminal. A job is any program that you can start 
from the command line. You can run a job in either 
the foreground, which lets you control it with input 
from the keyboard as it is running, or in the 
background, which frees your terminal for other 
work. You can have many jobs in the background, 
but only one job in the foreground. 

You may want to run a job in the background — for 
example, if you were processing a large file with 
illustrations for a PostScript printer which took a 
number of minutes. There are also some GCG 
(Genetics Computer Group) programs (Wisconsin 
Package™ programs) that require time to complete, 
so it makes sense to process the programs in the 
background while you do other work at your 
terminal. 


36.3.5 Background and foreground processing 


To run a job in the background, add an & (amper- 
sand) to the end of the command line. For example, 
% sort < unsorted.txt >> sorted.txt & 

The shell assigns your background job a number 
and displays that number on your terminal screen, 
along with the process identifier (pid) of the job. For 
example, 

[1]14999 
shows the job number as [1] and the pid as 14999. 

To see a list of all the jobs that are running in your 
login shell, use the % jobs command. On the list, 
each job is marked as either Running or Stopped. For 
example, this job is currently running: 

% jobs 
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[1] Running sort < unsorted.txt >> sorted.txt 
(Note that your login shell is never displayed in the 
jobs list.) A job will stop running if it needs 
information that you must enter at the keyboard, 
such as a filename or other parameter. When this 
happens to a background job, the shell displays a 
message like the following on your screen: 
[1] + Stopped (tty input) sort 
and the job stops running. To continue with this job, 
you need to enter information at the keyboard. 
However, before you can use the keyboard, you 
must bring the job into the foreground by using the 
% fg (foreground) command. 

To do this, you would enter 
% fg %1 
where the %1 is the way to refer to job number [1]. 

After you enter the information, you can put a 
foreground job into the background again by using 
Ctrl-Z to stop the job, then entering the command % 
bg (background) to put it in the background for 
processing. Because there is only one foreground job 
at any given time, it is not meaningful, or possible, to 
specify a job number or process id when stopping a 
job with Ctrl-Z. 


36.3.6 Wildchar characters 


There are several characters that allow you to enter 
only portions of the directory or filenames. The three 
most frequently required are: 

e A leading ‘~’ expands to the home directory of a 
particular user. 

° Each is interpreted as a specification for zero or 
more of any character. 

° Each ‘?’ is interpreted as a specification for exactly 
one of any character. 

For example, the pattern ‘dog’ will find matches 
for, among others, files named ‘dog’, ‘doge’, and 
‘doggy’. The pattern ‘dog?’ matches, among others, 
‘dogg’ but not ‘dog’ or ‘doggy’. 

% cp *.seq /usr/burgess/ mydirectory copies 
every file in the current directory that ends with the 
characters.seq to the directory called /usr/ burgess / 
mydirectory. 


36.4 Controlling output 


Ctrl-C ends a program or an executing UNIX 
command. 

Ctrl-Z suspends program execution. 

Ctrl-Q starts screen output that has been stopped. 

Ctrl-S stops screen output. 

Ctrl-R refreshes the command line. 

Ctrl-U 


deletes from the cursor to the beginning 
of a line. 


Note: Whenever you see a key combination 
written as Ctrl-C in the documentation, it means 
you press the <Ctrl> key and hold it down 
while you press the letter key, which in this case is 
<>: 

To restart a suspended program, type 

% fg %6 

which means put job number 6 in the foreground; 
% bg %2 

means put job number 2 in the background. 

If you cannot remember what programs you have 
suspended, type 
% jobs 
to list the jobs and job numbers 


36.5 Comparison of VMS and 
UNIX commands 


In both the VMS and UNIX versions of the 
Wisconsin Package™ for sequence analysis, you run 
a GCG program by entering information on the 
command line (after the VMS $ prompt and the 
UNIX % prompt). The command line can contain the 
name of a command, command qualifiers, qualified 
parameters (qualifiers with values), and unqualified 
parameters (usually file-names). The general syntax, 
or structure, of a VMS command is 
Command /QUALifier /QUALifier=Parameter 
Parameter 

As with all VMS commands, spaces on the 
command line are ignored, characters can be typed 
in upper case or lower case (case insensitivity), 
qualifiers are indicated by a ‘/’ (slash), parameters 
are indicated by an ‘=’ (equals sign), and the bold 
typeface indicates the fewest number of characters 
you enter. 

An example of a GCG command using VMS 
syntax is 
MapPlot /CIRcular 
EMBL:pBR322 

A UNIX command varies in several ways from a 
VMS command. The general syntax of a UNIX 
command is: 
command 
Parameter 

The main difference between the VMS and UNIX 
versions is that command names must be typed in 
full (no shortcuts) and in lower case. In addition, 
qualifiers are indicated by a space and a ‘~’ (hyphen), 
instead of a ‘/’ (slash). In both versions, qualified 
parameters are indicated by an ‘=’ (equals sign) and 
spaces are not accepted between a qualifier and its 
parameter(s). The case of unqualified parameters 
can vary, but if the unqualified parameter is a file 
name, you must enter the file name in the exact case 


/OUTfile=pBR322.MapPlot 


-QUALifier | -QUALifier=Parameter 
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shown. An example of the mapplot GCG command 
using UNIX syntax is: 
mapplot  -CIRcular 
EMBL:pbr322 


-OUTfile=pbr322.mapplot 


36.6 Control-key (* key) differences 


Control-key combinations that you use in VMS may 
produce different results when you use them in 
UNIX. (Note that in GCG documentation control- 
key combinations are written as Ctrl-C and not “C.) 
The following table lists the control key combina- 
tions in UNIX that are different in VMS: 


VMS UNIX Description 

Cay CuL-C ends a program 
(None) Ctrl-Z suspends a program 
Ctrl-Z Ctrl-D end of file 


36.7 Program name changes in 
UNIX GCG 


e Clear becomes ClearPlot 

¢ Echo becomes EchoKey 

e Extract becomes ExtractPeptide 
e Find becomes FindPatterns 
e Fold becomes FoldRNA 

e Shift becomes ShiftOver 

e Strings becomes StringSearch 


36.8 URLs for live help on the Web 


A Concise Guide to UNIX Books 
http://rclsgi.eng.ohio-state.edu /Unix-book-list. 
html 


All about Unix 
http:/ /ugrad-www.cs.colorado.edu/unix/ Home. 
html 


Fundamentals of Unix 
http:/ / www.gl.umbc.edu/~banz/intro_unix.html 


Introduction to Unix 
http:/ /musie.phlab.missouri.edu/IntroToUnix/un 
ix-tutor /index.html 


Top 10 Unix Questions at Dartmouth 
http:/ /coos.dartmouth.edu/~pete/top10.html 


UNIXhelp for Users 
http://www.ucs.ed.ac.uk/~unixhelp/servers.html 


Unix Documentation 
http: / /web.gmu.edu/bcox/Unix/00Unix.html 


Unix Documentation 
http://www.efs.mq.edu.au/unix/index.html 


Unix FAQ 
http: / /www.cis.ohio-state.edu/hypertext/faq/ 
usenet /unix-faq/unix/intro/faq.html 


Unix Resources 
http:/ /wwwhost.cc.utexas.edu/cc/services /unix/ 
index.html 


Unix for MS-DOS users 
http: / /ugrad-www.cs.colorado.edu/unix/ unix 
4dos.html 


Unix is a Four Letter Word ... 
http:/ /tempest.ecn.purdue.edu:8001 /~taylor/4ltr 
wrd/html/unixman.html 


Unix on the Macintosh 
http: / /www.astro.nwu.edu/lentz/mac/unix/ 
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37.1 Subject indexes 


37.1.1 Databases by organism 


Aedes aegyptii 
Mosquito Genomics WWW Server 
Aedes aegypti Genome Data Base (AaeDB) 


Anopheles gambiae 
Mosquito Genomics WWW Server 
AnoDB 


Arabidopsis thaliana 

Arabidopsis Biological Resource Centre 

Arabidopsis Information Management 
(AIMS) 

Arabidopsis Stock Centre (Nottingham) 

Arabidopsis thaliana Data Base (AtDB, formerly 
AAtDB) 

USDA Agricultural Plant Genomes 


System 


Bacillus subtilis 

NRSub 

BSORE (see GenomeNet) 

MICADO 

National Centre for Biotechnology Information 
genomes database 


Caenorhabditis elegans 

Caenorhabditis elegans Database (AceDB) 

Caenorhabditis elegans Genome Project 

Caenorhabditis Genetics Center (CGC) 
centre) 

Canadian Genome Analysis 
Program (CGAT) 

Moulon WWW server 

Sanger Centre 

Washington University School of Medicine Genome 

Sequencing Center 

XREFdb 


(stock 


and_ Technology 


Candida albicans 

University of Minnesota Medical School, Computa- 
tional Biology Center 

Virtual Genome Center 


Cereals 
USDA Agricultural Plant Genomes 


Chicken 
ChickMap 
Japan Animal Genome Databases 


Chlamydomonas 
ChlamyDB 


Comparative databases 

OMIA 

La Trobe University Comparative Genome Mapping 
Page 

Vertebrate Comparative Database 

XREFdb 


Cotton 
CottonDB 


Cow 

BovMap 

Japan Animal Genome Databases 
Meat Animal Research Center (MARC) 


Cyanobacteria 
GenomeNet, Japan 


Dog 
Dog Genome Project 
DogMap 


Drosophila 

Drosophila database, Stanford 
FlyBase 

TBASE 

University of California, Berkeley 


Escherichia coli 

Colibri 

EcoCyc 

ECO2DBASE: E. coli gene-protein database 
EcoSeq, EcoMap, EcoGene 

E. coli Genetic Stock Center 

Escherichia coli database collection (ECDC) 
GenProtEc: E. coli Gene Database 


Horse 
Japan Animal Genome Databases 
University of California, Davis 


Human (The entries listed here provide links to 

other sites; individual chromosomes are listed in 

numerical order in the main listing in Section 37.2.) 

BodyMap 

CEPH-Généthon Integrated map 

Genetic Location Data Base (LDB) 

Genome Data Base (GDB) 

Harvard Biological Laboratories 

Human Genome Project (HGP) at Oak Ridge 

National Laboratories 

Human population genetics database (Geno- 
graphy) 

Integrated Genomic Database (IGD) 

National Center for Genome Resources 
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UK Human Genome Mapping Programme Resource 
Centre 


Legumes 

AlfaGenes 
BeanGenes 
CoolGenes 


Maize 

Maize Genome DataBase (MaizeDB) 

University of Minnesota Medical School, Computa- 
tional Biology Centers 

USDA Agricultural Plant Genomes 


Microbial genomes 

DOE Microbial Genomes Initiative 

Caltech Genome Research Laboratory 

Canadian Genome Analysis and Technology Pro- 
gram (CGAT) 

Microbial Advanced Database 

TIGR 


Mitochondrial genomes 

Canadian Genome Analysis and Technology Pro- 
gram (CGAT) 

MITOMAP 

Organelle Genome Megasequencing Project (OGMP) 


Mouse 

Baylor College of Medicine Genome Center 

Caltech Genome Research Laboratory 

Dysmorphic Human—Mouse Homology Database 
(DHMHD) 

European Collaborative Interspecific Backcross 
(EUCIB) 

Gene Knockouts Database 

Mouse Genome Database (MGD) and the Encyclope- 
dia of the Mouse Genome 

Mouse Locus Catalogue (MLC) 

mousedb 

TBASE 

Whitehead Institute Mouse Genetic Map Infor- 
mation 


Mycobacterium 
Mycobacterium Database (MycDB) 
Sanger Centre 


Organelle genomes 

Canadian Genome Analysis and Technology Pro- 
gram (CGAT) 

MITOMAP 

Organelle Genome Megasequencing Project (OGMP) 


SSCS HC SHH E EHS OHODERES HEHEHE SEHEES 


Pig 

Meat Animal Research Center (MARC) 
PigMap 

TBASE 


Plant genomes 
USDA/ARS/NAL Plant Genome Data 


Plant pathogens 
PathoGenes 


Plasmodium 
Walter and Eliza Hall Institute of Medical Research 


Rat 
RATMAP 
TBASE 


Rice 

Japanese Rice Genome Program (RGP) 

RiceGenes 

USDA Agricultural Plant Genomes 

University of Minnesota Medical School, Computa- 
tional Biology Centers 


Saccharomyces cerevisiae 
Saccharomyces Genome Project 
Sanger Centre 

Saccharomyces Genome Database 
Yeast Protein Database 


Schistosoma 
Schistosoma Genome Project 


Schizosaccharomyces pombe 

Cold Spring Harbor Laboratory 

Sanger Centre 

Schizosaccharomyces pombe (NIH fission yeast in- 
formation) 


Sheep 
Meat Animal Research Center (MARC) 


SheepBase 


Solanaceae (e.g. tomato, potato) 
SolGenes 


Trees 
TreeGenes 


Yeast 
See Saccharomyces cerevisiae and Schizosaccharomyces 


pombe 
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Zebrafish 
FishNet 
Zebrafish Sequence Analysis Project 


37.1.2 Other databases and resources 


Cell lines 

American Type Culture Collection (ATCC) 

CBA-IST (Genova) 

European Collection of Animal Cell Cultures 

GDB 

National Institute of General Medical Sciences 
(NIGMS) Human Genetic Mutant Cell Repository 

Radiation Hybrid Database 


DNA sequences and genes 

CpG Island Database 

dbEST (expressed sequence tags) 
dbSTS (sequence tagged sites) 

EGAD (Expressed Gene Anatomy Database) 
EMBL Nucleotide Sequence Database 
GenBank 

Gene family database 

Mendel (plant genes) 

REPBASE (repetitive elements) 

V BASE (Ig genes) 


Proteins 

Danish Centre for Human Genome Research 

(Human 2-D PAGE databases) 

ENZYME (the Enzyme Nomenclature Database) 

Kabat Database of Proteins of Immunological 
Interest 

PIR (protein sequence database) 

Prodom (protein domain database) 

PROSITE (protein sites and patterns) 

REBASE (restriction enzymes) 

SWISS-PROT (protein sequence database) 

SWISS-2DPAGE (two-dimensional polyacrylamide 
gel electrophoresis database) 

SWISS-3DIMAGE (3D images of proteins and other 
biological macromolecules) 


37.1.3 Genome resource centres and information 


Australia 
Australian Genomic Information Centre (AGIC) 


Canada 

Canadian Genome Analysis and Technology 
Program (CGAT) 

InfoBiotech Canada 
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Denmark 
Danish Centre for Human Genome Research 


Europe 
EMBL/EBI: the European Bioinformatics Institute 


France 

Centre d’Etudes du Polymorphism Humain (CEPH) 
Genestream at EERIE 

Généthon 

InfoBioGen 

Institut Pasteur 

Moulon WWW server 


Germany 
Reference Library Database (RLDB) and Reference 
Library System 


Israel 
Weizmann Institute of Bio-informatics 


Japan 
GenomeNet, Japan 


Spain 
Centro Nacional de Biotecnologia (Madrid) 


Switzerland 
CBRG at ETHZ 
Geneva University 


United Kingdom 

Sanger Centre 

Roslin Institute, Edinburgh 

UK Human Genome Mapping Project Resource 
Centre 


United States 

AGIS 

Baylor College of Medicine Genome Center 

Caltech Genome Research Laboratory 

Cooperative Human Linkage Center (CHLC) 

Harvard Biological Laboratories 

Human Genome Program 

I.M.A.G.E. Consortium 

Lawrence Berkeley Laboratory 

Lawrence Livermore National Laboratory 

Los Alamos National Laboratory 

Motif BioInformatics Server 

National Center for Biotechnology Information 
(NCBI) 

National Center for Genome Resources (NCGR) 

National Center for Human Genome Research 
(NCHGR) 
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Stanford Human Genome Center 

TIGR 

University of Michigan Human Genome Center 

USDA Agricultural Plant Genomes/ USDA/ARS/ 
NAL Plant Genome Data 

Whitehead Institute / MIT Genome Center 


37.1.4 Software: sources of information and 
programs 


Computer-assisted design of oligonucleotide 

primers (e.g. OLIGO, OSP, PRIMER, PRIMEGEN) 

Whitehead Institute 

UK Human Genetic Mapping Project Resource 
Centre 


Construction of integrated maps (e.g. SIGMA) 
ICRF 

UK Human Genome Mapping Project Resource 
Centre (Mapping Analysis Menu) 

National Center for Genome Resources 


DNA sequence assembly and analysis 

George M. Church Laboratory 

GCG (Wisconsin Package) 

GenomeNet, Japan 

NIH sequence analysis services 

Oak Ridge National Laboratory 

Philadelphia Genome Center 

PYTHIA (see Chapter 25) 

Sanger Centre 

UK Human Genome Mapping Project Resource 
Centre 

University of Minnesota 


Linkage analysis and linkage mapping 

Columbia University Linkage Analysis Web Server 

Cooperative Human Linkage Center (CHLC) 

UK Human Genome Mapping Project Resource 
Centre 


Physical mapping 

Sanger Centre 

Imperial Cancer Research Fund (ICRF) 

UK Human Genome Mapping Project Resource 
Centre 

Généthon (QUICKMAP) 

University of Michigan 

Whitehead Institute 


Radiation mapping 

UK Human Genome Project Resource Centre 
University of Michigan 

Stanford Human Genome Center 
Whitehead Institute 


SOOM OECOF ASH EHHESSETOH RESTO HELE ETESDH EEE HES EEE DEEESEOESOS 


37.2 Alphabetical list of databases 
and genome resource centres 
accessible via the World Wide Web 


Many of these sites can be accessed most con- 
veniently through national or regional genome 
resource and information centres that have exten- 
sive links to other sites, such as the UK Human 
Genome Mapping Project Resource Centre, Harvard 
Biological Laboratories, the Motif Bioinformatics 
WWW Server, GenomeNet, Japan, AGIS, etc. You 
will need a WWW browser such as Netscape. Most 
services are publicly accessible with the appro- 
priate software but some require registration or 
subscription to use fully. This list does not aim to be 
comprehensive but many other services and sites 
may be accessed through the sites listed here. 


AaeDB see Aedes aegypti Genome Database 
ACeDB see Caenorhabditis elegans DataBase 


Aedes aegypti Genome Database (AaeDB) 
http://klab.agsci.colostate.edu/acedb/ AaeDB- 
acedb.html 

Held at the Colorado State University, this database 
aims to collate both genetic and physical chro- 
mosome mapping data for the mosquito Ae. aegypti, 
the vector of yellow fever. 


AGIC see Australian Genomic Information Center 
AGIS see Agricultural Genome Information Server 


Agricultural Biotechnology Information Center 
http://www.nalusda.gov/bic 


Agricultural Genome Databases see Agricultural 
Genome Information Server 


Agricultural Genome Information Server (AGIS) 
http:/ / probe.nalusda.gov:8000 

AGIS is a cooperative effort between the University 
of Maryland, Department of Plant Biology and the 
National Agricultural Library, and is sponsored by 
the US Department of Agriculture, Agricultural 
Research Service. You can browse and search a com- 
prehensive collection of genome databases relevant 
to crop plants, pests and domesticated animals. Also 
links to molecular biology and informatics servers 
such as the European Bioinformatics Institute, 
EMBL, National Center for Biotechnology Infor- 
mation, National Center for Genome Resources, etc. 
AGIS also holds genome analysis tools and 
software. The ‘how to’ page has useful advice on 
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retrieving data for those new to genome research. 


AIMS see Arabidopsis Information Management 
System 


AlfaGenes 

http: / / probe.nalusda.gov:8000/plant/ 

Database (still in the experimental stages) for alfalfa 
(Medicago sativa). Curators: Daniel Z. Skinner (e- 
mail: dzolek@ksu.ksu.edu) and Paul C. St. Armand 
(e-mail: pst@ksu.ksu.edu). 


ANGIS (Australian National Genomic Information 
Service) see Australian Genomic Information Centre 


American Type Culture Collection (ATCC) 
http://www.atcc.org / 


AnoDB 

http:/ /konops.imbb.forth.gr/AnoDB 

Database for Anopheles gambiae, the vector of 
malaria. 


Arabidopsis Biological Resource Center 
Ohio State University, 1735 Neil Avenue, Columbus, 
OH 43210, USA (e-mail: Arabidopsis+@osu.edu). 


Arabidopsis Stock Centre (Nottingham) 
http: / /nasc.life.nott.ac.uk 


Arabidopsis 
(AIMS) 
http: / /genesys.cps.msu.edu.3333/ 


Information Management System 


Arabidopsis thaliana Database (AtDB, formerly 
AAtDB) 

http: //genome-www:stanford.edu/ Arabidopsis 
Located in the Department of Genetics, School of 
Medicine, Stanford University, USA. For queries 
contact atdb-curator@genome.stanford.edu 
Anonymous ftp to ftp-genome.stanford.edu 


AtDB see Arabidopsis thaliana Database 
ATCC see American Type Culture Collection 


Australian Genomic Information Centre (AGIC) 
http: //angis.su.oz.au/Agic/about.html 

Objectives are to manage the Australian National 
Genomic Information Service (ANGIS) and to 
conduct collaborative bioinformatics research and 
development projects. Located in the Electrical 
Engineering Department of the University of 
Sydney. It also hosts ABNET, the Australian 
Bioinformatics Network. For further information 
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contact Tim Littlejohn (e-mail: tim@angis.su.ox.au; 
tel./fax: (+61 2) 351 2948). 


Bacillus subtilis Database (NRSub) 

http:/ /acnuc.univ-lyon1.fr/nrsub/nrsub.html 

DNA sequences from B. subtilis. Additional data on 
gene mapping and codon usage are provided by 
links to other sites (e.g. InfoBioGen, EBI, HGMP). 
This database is mirrored in Japan at GenomeNet, 
Japan. 


Baylor College of Medicine Genome Center 
http://gc.bcm.tmc.edu:8088 /home.html 

Projects on human chromosomes 6, 15, 17, X and 
mouse X. 


BeanGenes 

http: / / probe.nalusda.gov:8000/plant/ 

Database for information on Phaseolus and Vigna 
species. Curator: Phil McClean, North Dakota State 
University, Fargo, North Dakota, USA (e-mail: 
cclean@beangenes.cws.ndsu.nodak.edu). 


BodyMap 

http: / /imcb.osaka-u.ac.jp/bodymap/welcome. 
html. 

An anatomical expression database of human genes. 


BovMaP 

http: / /locus.jouy.inra.fr /~samson/bovmap/intro. 
html 

Contains information on cattle loci, alleles, genetic 
and physical maps, polymorphisms, homologies, 
probes, primers and references. 


Brookhaven National Laboratory (Biology) 
http: / /bnistb.bio.bn1.gov:8000/ 
Information on DNA sequencing project. 


Caenorhabditis elegans Genome Project 

http: / /www.sanger.ac.uk 

http:/ /genome.wustl.edu/gsc/gschmpg.html 
Information on the sequencing work carried out at 
the Sanger Centre and Washington University, St 
Louis as part of the C. elegans genome project, 
including cosmid sequences, access to ACeDB, 
genome maps, analytical and assembly software, 
etc. Links to other sites. 


Caenorhabditis elegans Database (ACeDB) Moulon 
server 
http:/ /moulon.inra.fr:8001/acedb/acedb.html 


Caenorhabditis elegans WWW server (University of 
Texas Southwestern Medical Centre at Dallas) 
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http://eatworms.swmed.edu 


Caenorhabditis Genetics Center (CGC) 

http:/ /elegans.cbs.umn.edu/ 

Based at the University of Minnesota, the Center 
maintains and distributes stocks of Caenorhabditis 
elegans mutant strains. Curator: Theresa L. Stier- 
nagel, Caenorhabditis Genetics Center, 250 Bio- 
logical Sciences Center, 1445 Gortner Avenue, 
University of Minnesota, St Paul, MN 55108-1095, 
USA (e-mail: stier@molbio.cbs.umn.edu). 


Caltech Genome Research Laboratory 

http:/ / www.tree.caltech.edu/ 

Construction of Human Bacterial Artificial Chromo- 
some (BAC) Library resource. Physical mapping of 
human chromosome 22 using BAC clones and YAC 
frameworks. Construction of Mouse Bacterial 
Artificial Chromosome (BAC) Library resource. 
Sequencing the 1.8 megabase genome of the arche- 
bacterium Pyrobaculum aerophilum. 


Canadian Genome Analysis and Technology Pro- 
gram (CGAT) 

http:/ /cgat.bch.umontreal.c. 

Contains information resources for genomics 
research in Canada, genomics data bases and infor- 
mation, data on chromosome X, C. elegans cosmid 
transgenics, the Organelle Genome Database 
(GOBASE) and Organelle Genome Megasequencing 
Program, the Rose Worm lab, and the Sulfolobus 
solfataricus Genome Project. 


Candida albicans 

http: / /alces.med.umn.edu/Candida.html 
Contains information on the genetics, physical map, 
and sequence data of C. albicans. 


CBA-IST (Genova) 

http://www.ist.unige.it/ 

Contains the Cell Line Database listing 3000 human 
and animal cell lines from European culture 
collections. 


CBRG at ETHZ the Computational Biochemistry 
Server at ETHZ (the Swiss Federal Institute of 
Technology) 

http: //cbrg.inf.ethz.ch/ 

Collection of programs for SWISS-PROT and other 


database searching and for constructing phylo- 
genetic trees. 


Cedars-Sinai Research Institute Molecular Genetics 


Laboratories 
http: / /www.csmc.edu/genetics/korenberg/koren- 


berg.html 

Contains the integrated YAC/BAC/PAC resource 
for the human genome, the chromosome 21 pheno- 
typic mapping project, gene mapping projects on 
chromosome 21, a BAC contig of the chromosome 21 
congenital heart disease region. 


Centro Biotecnologie Avanzate, Genova, Italy see 
CBA-IST 


Centro Nacional de Biotecnologia (Madrid) 
http: / /www.cnb.uam.es/ 


Centre d’Etudes du Polymorphism Humain (CEPH) 
(Fondation Jean Dausset) 

http:/ /www.cephb.fr/HomePage.html 

CEPH is a research laboratory created in 1984 by 
Professor Jean Dausset, which constructs maps of 
the human genome. This site contains the CEPH- 
Généthon integrated maps (see below) and the 
CEPH genotype database (see Chapter 5) and the 
CEPH YAC library (see below). 


CEPH-Généthon Integrated Map 

http: / /www.cephb.fr/ceph-genethon-map.html 
Contains information about the CEPH YAC library, 
contig maps, STS data, Alu-PCR hybridization data, 
fingerprint data, sizing data and FISH data. Primary 
copies of the CEPH YAC library of 33000 clones are 
held at the following centres. 

USA E.S. Lander or T. Hudson, Whitehead Insti- 
tute/MIT Center for Genome Research, Cambridge, 
MA 02142, USA (e-mail: lander@genome.wi.mit. 
edu) 

Europe 

e D. LePaslier, Foundation Jean-Dausset-CEPH, 27 
rue Juliette Dodu, 75010 Paris, France (e-mail: 
denis@ceph.cephb.fr) 

e H. Lehrach, the Reference Library Database 
(RLDB), Max Planck Institute for Molecular 
Genetics, Ihnestrasse 73, 14195 Berlin-Dahlem, 
Germany (tel.: (+49 30) 8413 1627; fax: (+49 30) 8413 
1395) 

¢ D. Toniolo, GBE, CNR, via Abbiategrasso 207, 
27100 Pavia, Italy (tel.: (+39 382) 546 340; fax: (+39 
382) 422 286) 

° GJ.B. van Ommen, YAC Screening Centre, Leiden 
University, Department of Human Genetics, 
Wassenaarseweg 72, 2333 Al Leiden, the Nether- 
lands (tel.: (+31 71) 276081; fax: (+31 71) 276075) 

e K. Gibson, Human Genome Mapping Project 
Resource Centre (HGMP), Hinxton Hall, Hinxton, 
Cambridge CB10 1RQ, UK (tel.: (+44 1223) 494 500; 
fax: (+44 1223) 494 512) 
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Japan 

¢ K. Yokoyama, 3-1-1 Koyadai, Tsukuba, Ibaraki 
305, Japan (tel: (+81 298) 36 3612; fax: (+81 298) 36 
9120) 

e Y. Nakamura, Human Genome Centre, Institute 
of Medical Science, the University of Tokyo, 4-6-1 
Shirokaneda, Minato, Tokyo 108, Japan (tel: (+81 3) 
5449 5372; fax: (+81 3) 5449 5433) 

China 

Z. Chen, Shanghai Institute of Haematology, Rui-Jin 
Hospital, Shanghai Second Medical University, 
Shanghai 200025, China (tel.: (+86 21) 318 0300; fax: 
(+86 21) 474 3206) 


CGC see Caenorhabditis Genetics Center 
CGSC see E. coli Genetic Stock Center 


ChickMap 

http:/ /www.ri.bbsre.ac.uk/chickmap 

Project to produce an integrated map of the chicken 
genome. 


Children’s Hospital of Philadelphia 
http: / /www.cbil.upenn.edu/HGC22.html 
Human chromosome 22. 


ChlamyDB 

http:/ /probe.nalusda.gov:8000/plant 

Database on Chlamydomonas reinhardtii including: 
genetic and molecular maps; information on genetic 
loci, mutant alleles, and sequenced genes; descrip- 
tion of strains; contacts; bibliography; information 
on the Chlamydomonas Genetics Center. Curator: 
Elizabeth H. Harris (Duke University) (e-mail: 
chlamy@acpub.duke.edu) 


CHLC see Cooperative Human Linkage Center 


Chromosome 1 see Columbia Linkage Analysis Web 
Server for Chromosome 1 workshop 


Chromosome 2 see Imperial Cancer Research Fund 


Chromosome 3 A second-generation YAC contig 
map. All maps and supporting data tables are 
available by anonymous ftp from ftp://thor.hsc. 
colorado.edu. 
See also Sanger Centre; University of Texas Health 
Science Center 


Chromosome 4 see Sanger Centre; Stanford Human 
Genome Center 


Chromosome 6 see Sanger Centre 


Chromosome 9 see Galton Laboratory 


Chromosome 10 see Genome Therapeutics Corpor- 
ation 


Chromosome 11 see Sanger Centre; University of 
Texas Southwestern Medical Center 


Chromosome 12 see YaleGenome Center 


Chromosome 13 see Columbia University Human 
Genome Project; Sanger Centre 


Chromosome 15 see Baylor College of Medicine 


Chromosome 16 see Los Alamos National Labora- 
tory; Sanger Centre; TIGR 


Chromosome 17 see Baylor College of Medicine 


Chromosome 19 see Lawrence Livermore National 
Laboratory 


Chromosome 21 see Cedars-Sinai Research Institute 
Molecular Genetics Laboratories; Lawrence Berke- 
ley Laboratory 


Chromosome 22 see Caltech Genome Research 
Laboratory; Computational Biology and Informatics 
Laboratory (CBIL) at Pennsylvania University; 
Sanger Centre; University of Oklahoma 


Chromosome X see Baylor College of Medicine; 
Canadian Genome Analysis and Technology Pro- 
gram; Lawrence Livermore National Laboratory; 
Max Planck Institute for Molecular Genetics; Sanger 
Centre 


Chromosome Y see Galton Laboratory 


Cold Spring Harbor Laboratory 

http:/ /www.cshl.org 

Schizosaccharomyces pombe sequencing and Arabi- 
dopsis sequencing. 


Colibri 

A relational database dedicated to the analysis of 
the E. coli genome. Macintosh application available 
via anonymous ftp (ftp.pasteur.fr, in the directory 
pub/GenomeDB/Colibri). For additional informa- 
tion, contact Ivan Moszer (moszer@pasteur.fr) or 
Antoine Danchin (adanchin@pasteur.fr). 


Columbia Linkage Analysis Web Server 
http: / /linkage.cpmc.columbia.edu/ 
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Genetic linkage analysis software. Chromosome 1 
Workshop. Links to other sites. 


Columbia University Human Genome Project 
http://genomel.ccc.columbia.edu/~genome/ 
Human chromosome 13 data (YACs, cosmids, STSs, 
markers). 


CompDB see Vertebrate Comparative Database. 


Computational Biology and Informatics Laboratory 
(CBIL) at Pennsylvania University 

http: //cbil-humgen.upenn.edu/ 

Human chromosome 22 data. 


CoolGenes 

http: / / probe.nalusda.gov:8000/plant/ 

Database (still being developed) for cool season food 
legumes (Pisum, Lens, Cicer, Lathyrus, Vicia faba). 
Curator: Fred Muehlbauer (e-mail: muehlbau@wsu. 
edu). 


Cooperative Human Linkage Center (CHLC) 

http:/ /www.chlc.org/ 

Based at the University of lowa, the goal of CHLC is 
to develop statistically rigorous, high heterozy- 
gosity genetic maps of the human genome that are 
greatly enriched for the presence of easy-to-use 
PCR-formatted microsatellite markers. Available at 
this site are: genetic maps showing the positions of 
genetic markers; integrated maps showing the 
position of genetic markers constructed using 
genotype data from the CEPH reference panel; 
CHLC marker maps showing the positions of CHLC 
generated markers in various reference maps; 
information on markers. 


CottonDB 

http:// probe.nalusda.gov:8000/plant/ 

A database containing information on Gossypium 
hirsutum and related species. Curators: Gerard Lazo: 
(e-mail: lazo@tamu.edu) and Sridhar Madhavan: (e- 
mail: msridhar@tamu.edi) 


CpG Island Database 

http: / /biomaster.uio.no/cpgdb.html 

The CpG island database is maintained at the 
Biotechnology Centre of Oslo. It deals with human 
genes appearing in major releases of the EMBL 
nucleotide sequence database but it is hoped that 
in the future it will include islands from other 
mammalian species. 


Danish Centre for Human Genome Research 


COHCOHOT OHSAS EHH CES ETE EES ED OSH OEE EEeenere 


http: / /biobase.dk/cgi-bin/celis 
Holds human 2-D PAGE data bases. 


dbEST (Expressed Sequence Tags) 
http://www.ncbi.nlm.nih.gov/dbEST /index.html 
A division of GenBank that contains sequence data 
and other information on cDNA sequences charac- 
terized as single reads from DNA sequencing from a 
number of organisms. 


dbSTS (Sequence Tagged Sites) 

http: / /www.ncbi.nlm.nih.gov/dbSTS/index.html 
An NCBI resource that contains sequence and map- 
ping data on short genomic landmark sequences. 


Dendrome Project 
http: / / probe.nalusda.gov:8000/plant/ 
A genome database for forest trees. 


DHMHD see Dysmorphic Human—Mouse Homo- 
logy Database 


DOE Human Genome Program 

http:/ /www.er.doe.gov/production/oher/ 
hug_top.html 

Information on the DOE projects; links to Los 
Alamos, Lawrence Berkeley, Lawrence Livermore 
and other laboratories involved in the Human 
Genome Project; a primer on the basic science of the 
Human Genome Project; project resources and 
meetings. 


DOE Microbial Genomes Initiative 

http:/ /www.er.doe.gov/production/oher/ 
mig_top.html 

Projects to sequence a variety of microbial genomes. 
See TIGR. 


Dog Genome Project 
http: / /mendel.berkeley.edu/dog-html 


DogMap 

http: / /ubeclu.unibe.ch/itz/markma.html 

A low-resolution map of the canine genome being 
constructed by an international collaboration under 
the auspices of the International Society for Animal 
Genetics. Information from Gaudenz Dolf, Institute 
of Animal Breeding, University of Berne, Brem- 
gartenstrasse 109a, 3012 Berne, Switzerland (e-mail: 
gauden@itz.unibe.ch). 


Drosophila database . 
http:/ /www-leland.stanford.edu/~ger / drosphila. 


html 
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Dysmorphic Human-Mouse Homology Database 
(DHMHD) 

http: //www.hgmp.mrc.ac.uk/ DHMHD/dysmorp 
h.html 

Three separate databases of human and mouse 
malformation syndromes together with a database 
of mouse/human syntenic regions. The mouse and 
human malformation databases are linked together 
through the chromosome synteny database. The 
purpose of the system is to allow retrieval of 
syndromes according to detailed phenotypic de- 
scriptions and to be able to carry out homology 
searches for the purpose of gene mapping. 


EBI see European Bioinformatics Institute 


E. coli clones 

Dr Y Kohara, National Institute of Genetics, 
Mishima, Shizuoka-ken 1111, Japan (fax: +81 559) 81 
6826. 


E. coli Genetic Stock Center (CGSC) 
http://cgsc.biology.yale.edu/top.html 

The E. coli Genetic Stock Center (CGSC) at the 
Department of Biology, 3550ML, Yale University, PO 
Box 208104, New Haven CT 06520-8104, USA (fax: 
+1 203) 432-3852) maintains a database of E. coli 
genetic information, including genotypes and 
reference information for the several thousand 
strains in the CGSC collection, a gene list with map 
and gene product information, and information on 
specific mutations. For information contact Mary 
Berlyn (e-mail: mary@cgsc.biology.yale.edu). 


ECACC see European Collection of Animal Cell 
Cultures 


E. coli genome project 
http: / /ecoliftp.genetics.wisc.edu 


ECDC: Escherichia coli database collection 
http://susi.bio.uni-giessen.de/usr/local/ www / 
html/ecde.html 

Contains information for the entire E.coli K-12 
chromosome, and is organized like a genetic map. 
the database can be searched for gene names or map 
positions. Coding sequences are indicated for each 
gene. Regulatory regions, promoters, terminators, 
and IS elements are also indicated. the complete 
ECDC dataset is available by anonymous ftp 
(susi.bio.uni-giessen.de) or together with a Win- 
dows application on the EMBL (EBI) CD-ROM. for 
information contact Manfred Kroger (e-mail: 
kroeger@embl-heidelberg.de). 


SSO Seo ease eeHHesFOUHeeseSEHeHeDensaeenesaseenseseE 


ECO2DBASE: E. coli gene—protein data base 
Contains information about E. coli proteins obtained 
by the analysis of two-dimensional protein gels, and 
is maintained by EC. Neidhardt. ‘Ed6.0195’ is the 
database file (text format) for the sixth published 
version of the database. Updates will have a 
different extension after the decimal. Available by 
anonymous ftp by ftp://ncbi.nlm.nih.gov/reposi- 
tory /ECO2DBASE. Questions and comments can be 
sent to Ruth VanBogelen (e-mail: vanbogr@aa.wl. 
com) or Fred Neidhardt (e-mail: feneid@umich.edu). 
EcoCye: Encyclopedia of E. coli Genes and 
Metabolism 

http:/ /www.ai.sri.com/ecocyc/ecocyc.html 

A database integrating information about E. coli 
genes and metabolism. A graphical user interface 
creates drawings of metabolic pathways, of indivi- 
dual reactions, and of the E. coli genomic map. Users 
can call up objects through a variety of queries and 
then navigate to related objects shown in the display 
window. 


EcoSeq, EcoMap, EcoGene 

EcoSeq is a nonoverlapping E. coli DNA sequence 
collection which integrates information about genes, 
DNA and protein sequences. EcoMap integrates 
EcoSeq with a genomic restriction map. EcoGene 
contains information about identified and putative 
protein- and RNA-encoding genes, and translations 
of sequences thought to encode proteins. These data 
are correlated and cross-referenced with the SWISS- 
PROT protein sequence database. Available by 
anonymous ftp from ftp://ncbinlm.nih.gov/ 
repository /Eco/ 

For additional information, contact Kenn Rudd 
(e-mail: rudd@ncbi.nlm.nih.gov). 


EGAD (Expressed Gene Anatomy Database) see 
TIGR 


EHCB (European Human Cell Bank) see European 
Collection of Animal Cell Cultures 


EMBL see European Molecular Biology Laboratory 


EMBL Nucleotide Sequence Database see European 
Bioinformatics Institute 


ENZYME see Geneva University 


European Bioinformatics Institute (EMBL /EBI) 
http: / /www.ebi.ac.uk/ 

An outstation of the European Molecular Biology 
Laboratory. It holds the EMBL Nucleotide Sequence 
Database; SWISS-PROT protein sequence database; 
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dbEST; dbSTS; Radiation Hybrid Database; IMGT 
immunogenetics database; the PCR _ primers 
database; and Flybase. Also documentation and 
software. 


European Collaborative Interspecific Backcross 
(EUCIB) 

http: / /www.hgmp.mrc.ac.uk/MBx/ 
MBxHomepage.html 

The latest EUCIB high-resolution mouse microsatel- 
lite maps and mapping data. 


European Collection of Animal Cell Cultures 
(ECACC) 

http: / /www.gdb.org /annex/ecacc/HTML/geninf 
-html 


European Human Cell Bank (EHCB) see European 
Collection of Animal Cell Cultures 

Specialist collection of cell lines derived from 
patients with genetic disorders or chromosome 
abnormalities. 


European Molecular Biology Laboratory (EMBL) 
http: / /www.embl-heidelberg.de/ 


ExPASy see Geneva University 


FishNet 

http: / /zfish.oregon.edu 
http://www-igbmc.u-strasbg.fr/ index.html 
Information on zebrafish genome projects and links 
to other sites. 


FlyBase 

http: / /morgan.harvard.edu:80/ 

http: / /flybase.bio.indiana.edu:82/ 

http:/ / www.embl-ebi.ac.uk/flybase 

http:/ /www.angis.su.oz.au:7081 / 

http: / /shigen.lab.nig.ac.jp:7081 jp 

A database containing information on the genetics 
and biology of Drosophila species. FlyBase contains 
the text of the Lindsley and Zimm ‘Red Book’ (only 
in the US copy, owing to copyright reasons), lists 
of chromosome aberrations (sorted by class and 
cytological breakpoints), molecular clones, the 
genetic map, and stock lists of the international 
Drosophila stock centres. Using Gopher, these files 
can be interactively searched. 


Galton Laboratory 

http:// diamond.gene.ucl.ac.uk 

Chromosome 9 Workshop reports, maps, contact 
addresses. Chromosome Y fingerprint data. Linkage 
software. 


SRHPCHH ROBE HOHE HHESEORHOHEORE REET ERED 


GCG see Genetics Computing Group 
GDB see Genome Database 


GenBank see National Centre for Biotechnology 
Information 


Gene family database 
http: / / gdbdoc.gdb.org /~avolz/home.html 


Gene Knockouts Database 

http: / /www.bayanet.com/bioscience/knockout/ 
knochome.htm 

Data on phenotypes obtained by the knockout of 
various molecules in mice. 

See also Appendix VIII in this book. 


Genestream at EERIE 

http: / /genome.eerie.fr/Genome.html 

The Southern France Human Genome Project 
Computing Resource Centre. 


Généthon 
http: / /www.genethon.fr/ 
See also CEPH. 


Genetic Location Database (LDB) 

http: / /cedar.genetics.soton.ac.uk/ public_html/ 
An analytical database held at the University of 
Southampton, UK for constructing fully integrated 
genetic and physical maps (see Chapters 3 and 16). 
the ldb program generates an integrated map 
(known as the summary map) from partial maps of 
physical, genetic, regional, somatic hybrid, mouse 
homology and cytogenetic data. The summary maps 
and the data used to build up such maps are 
available from the WWW site. The files for each 
chromosome are stored in the same directory that 
includes the summary map, partial maps, lod files 
and the parameter files. Alternatively, the ldb 
program can be downloaded and used to create 
the user’s own integrated maps. Submissions to 
arc@southampton.ac.uk 


Genetics Computer Group (GCG) 
http://www/gcg.com 

Commercial suppliers of the Wisconsin Package™ 
for sequence analysis. 


Geneva University (ExPASy) 

http:/ /expasy-hcuge.ch/ 

The molecular biology server of the Geneva 
University Hospital and the University of Geneva, 
which is dedicated to the analysis of protein and 
nucleic acid sequences and 2-D PAGE. You can 
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search the databases SWISS-PROT (annotated 
protein sequence database); PROSITE (a dictionary 
of protein sites and patterns); SWISS-2DPAGE (two- 
dimensional polyacrylamide gel electrophoresis 
database); SWISS-3DIMAGE (3D images of proteins 
and other biological macromolecules); ENZYME 
(the Enzyme Nomenclature Database); and 
SeqAnalRef (a sequence analysis bibliographic 
reference database). The site also contains tools and 
software packages. 


GenoBase see NIH GenoBase server 


Genome Database (GDB) 

http://gdbwww.gdb.org/ 

The main repository for human genetic data. It 
contains entries for individual loci which are linked 
to other databases such as MGD, FlyBase and the 
Enzyme Nomenclature Database at SWISS-PROT 
so that homologies with other mammalian and 
Drosophila genes, and enzyme function data can also 
be obtained. The database also contains information 
on polymorphisms, some 70 maps (linkage, cyto- 
genetic, radiation hybrid and the latest Généthon 
map), mutations, probes, clone libraries, cell lines, 
citations, and contact addresses. 


Genome Therapeutics Corporation (GTC) 

http:/ /wwwa.cric.com/ 

Commercial organization. Chromosome 10 physical 
mapping data. 


GenomeNet, Japan 

http://www.genome.ad.jp/ 

Provides access to other databases and sequence 
interpretation tools. Databases include BSORE, the 
Bacillus subtilis database held at the University of 
Tokyo; the Escherichia coli Databank: CyanoBase, a 
database for Synechocystis spp.; BodyMap, an 
expression database of human genes; SPAD, a 
signaling pathway database; and the Aberrant 
Splicing Database, and there are links to many of the 
main genomic and sequence databases (e.g. GDB, 
EMBL). It also holds the Kyoto Encyclopedia of 
Genes and Genomes, and software for sequence 
interpretation. 


GenProtEc: E.coli Gene Database 
http://www.mbl.edu/ ~dspace/eco.html 

A compilation of E. coli genes and gene products, 
categorized by physiological function, this database 
also includes homology information for proteins 
similar to at least one other E. coli protein. Available 
by ftp://hoh.mbl.edu/pub/ ecoli.zip 


COHSHOHHSHHTOSEEH SHH SOHEHUEHLOHRHHEHSHHE TRH OSEHOEHVEREEESEEE 


George M. Church Laboratory 
http: / /twod.med.harvard.edu 
Software for DNA sequence analysis. 


GOBASE see Canadian Genome Analysis and 
Technology Program 


GSDB see National Centre for Genome Resources 
(NCGR) 


Harvard Biological Laboratories 
http://golgi/harvard/edu 

Extensive links to other sites. CGC software docu- 
mentation. Genome databases for Arabidopsis, C. 
elegans, Drosophila, human, mouse, prokaryote, and 
yeast. Searches of sequence databases, Entrez, 
culture collections and REBASE, and information on 
internet resources and searching for information 
across the networks. 


HGP: Human Genome Project at Oak Ridge 
National Laboratories 
http://www.ornl.gov/TechResources/Human_Ge- 
nome/home.hmtl 

Links to all genome centres participating in the US 
Human Genome Project. 


Human Genome Program 
http://www/er/doe.gov/production/oher/ hug_ 
top.html 

The human genome programme of the US 
Department of Energy. 


Human population genetics database (Geno- 
graphy) 

http:/ /lotka.stanford.edu/ genography.html. 
Database and information on human population 
genetics. 


ICRF see Imperial Cancer Research Fund 


I.M.A.G.E. Consortium (Integrated Molecular 
Analysis of Genomes and their Expression) 
http://www bio.lInl.gov/bbrp/ image/image.html 
Information on and availability of more than 200000 
arrayed human cDNA clones. 


Imperial Cancer Research Fund (ICRF) 

http:/ /www.icnet.uk 

Chromosome 2 mapping information. EUROGEM 
Project information. ICRF contig-building package 
from __ftp.icnet.uk/icrf-public/ GenomeAnalysis/ 
icrf_contig_v2.tar.Z 
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Indiana University 
http: / /ftp.bio.indiana.edu 


InfoBioGen 

http: / / www.infobiogen.fr/ 

Computing and information resource for the French 
molecular biology and genome projects. Mainly in 
French. 


InfoBiotech Canada 

http:/ /www.ibe.nre.c./ibe 

Information on biotechnology in Canada and 
worldwide. Contains several databases and links 
to numerous other sites. Internet wide searches 
possible. 


INRA (Institut National de la Recherche Agrono- 
mique) Biotechnology Laboratories 
http:/ /locus.jouy.inra.fr 


Institut Pasteur 
http:/ / www.pasteur.fr / 
Links to many databases and useful sites. 


Integrated Genomic Database (IGD) 
Moulon server http: //moulon.inra.fr:8001 /acedb/ 
igd.html 


Japan Animal Genome Databases 

http:/ /ws4.niai.affre.go.jp/jgbase.html 

Search genome databases on pig (cytogenetic map, 
USDA linkage map, PIGM linkage map); cattle 
(cytogenetic map, USDA linkage map); chicken 
(cytogenetic map; linkage map); horse (cytogenetic 
map). 


Japanese Rice Genome Program (RGP) 
http://www:staff.or.jp 
Information service on rice genome research. 


Johns Hopkins Bio-informatics WWW Server 

http: / /www.gdb.org/hopkins.html 

Holds protein databases including PIR (protein 
identification resource-protein sequence database) 
and REBASE (the restriction enzyme database). 
Holds TBASE and the DOE Human Subjects 
Database. 


Johns Hopkins GDB WWW Server 
http://gdbwww.gdb.org 

Holds information about GDB and its future 
developments. Holds a GDB browser; OMIM; 
ideogram-based searching of GDB; maps of HUGO 
reference markers. 


Kabat Database of Proteins of Immunological 
Interest 


http:/ /immuno.bme.nwu.edu/ 


La Trobe University Comparative Genome Mapping 
Page 

http: / /www.latrobe.edu.au/www/ genetics / 
compmap.html 

CompMap—clickable human chromosome maps 
with corresponding loci from a wide range of 
species. Links to other sites. 


LANL see Los Alamos National Laboratory 


Lawrence Berkeley Laboratory (LBL) Human 
Genome Center 

http://www-hgc.lbl.gov /GenomeHome html 
Human Chromosome 21 P1 and cDNA mapping 
data bases. Drosophila physical mapping. Instrumen- 
tation and informatics projects. Human chromo- 
some and directed genome sequencing projects. 


Lawrence Livermore National Laboratory (LLNL) 
http: / /www_bio.lInl.gov.bbrp/genome.html 
Biology and Biotechnology Research Program 
http:/ / wwwbio.llnl.gov/bbrp/bbrp.homepage. 
html 

Human Genome Center. Physical maps of human 
Chromosome 19. Closure of the chromosome 19 
map. Enhancement of the high resolution clone map 
of human chromosome X. DNA Sequencing. 
National Laboratory Gene Library Project. Alu 
Repeats: a novel source of genetic variation for 
mapping. Informatics and analytical genomics: 
Instrumentation for the Human Genome Project. 
I.M.A.G.E. Consortium home page. 


LBL see Lawrence Berkeley Laboratory Human 
Genome Center 


LDB see Genetic Location Database 
LLNL see Lawrence Livermore National Laboratory 


Los Alamos National Laboratory (LANL) 
http://www-t10.lanl.gov/ 

Sigma chromosome maps. Chromosome 16 flat file. 
HIV databases. 


MaizeDB: Maize Genome Database 

http: / /www.agron.missouri.edu 

Curated by the USDA Plant Genetics Unit located 
within the College of Agriculture of the University 
of Missouri-Columbia. Contains genetic maps, 
mapped loci, recombination and map score data, 
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probe data, genetic/cytogenetic stocks, locus 
variations, references, contact addresses of maize 
researchers. 


Max Planck Institute for Molecular Genetics, Berlin 
http://www.mpimg-berlin dahlem.mpg.de/~xteam 
Chromosome X. 


MBx database see European Collaborative Inter- 
specific Backcross 


Meat Animal Research Center (MARC) 
http:/sol.marc.usda.gov 
Pig and cattle genome data and maps. 


Mendel 

http: // probe.nalusda.gov:8000/plant 

A database of designations for plant-wide families 
of sequenced plant genes and designations for 
sequenced genes in individual plant species. 


MGD see Microbial Database; Mouse Genome 
Database 


MICADO see Microbial Advanced Database 
Organization 
Microbial Advanced Database Organization 


(MICADO) 
http://locus.jouy.inra.fr 
Bacillus subtilis and E. coli databases. 


Microbial Database (MGD) 

http://www.tigr.org / 

Sequence of the Haemophilus influenzae, Mycoplasm 
genitalium, and other bacterial genomes. 


MITOMAP 

http:/ / www.gen.emory.edu/mitomap.html 

A mitochondrial DNA database held at Emory 
University, Atlanta, which contains information on 
the human mitochondrial genome. 


Mosquito Genomics WWW Server 
http://klab.agsci.colostate.edu / 

Holds the Aedes aegypti linkage and physical maps 
and AaeDB database. Holds MsqDB, a database 
of information across mosquito species. Holds 
FlyBase. Link to AnoDB. 


Motif BioInformatics WWW Server 
http://motif.stanford.edu 
Extensive links to other sites and databases. 


POSCHHHCEHHEEHREETH DEED ORS SOHEEOHRKEKO THE USE EE EEDSHEDELOE BYE 


Moulon WWW server 
http://moulon.inra.fr:8001 

Holds the C. elegans database (ACeDB), the 
Integrated Genome Database (IGD), a metabolic 
database, and information on setting up ACeDB- 
style databases on the WWW. 


Mouse Genome Database and the Encyclopedia of the 
Mouse Genome 

http://www.informatics.jax.org/ mgd.html 
http://mgd.hgmp.mrc.ac.uk/mgd.html 

Mouse genetic mapping information from centre 
programmes, collaborative programmes and single 
laboratory efforts worldwide is regularly trans- 
ferred to the Mouse Genome Database (MGD) at 
the Jackson Laboratory (Bar Harbor, Maine, USA) 
and presented in the latest issue of the E ncyclopedia of 
the Mouse Genome—a tool for the presentation of 
mouse genome and related information. MGD 
contains mouse locus information; genetic mapping 
data; mammalian homology data; probes, clones 
and PCR primers; genetic polymorphisms; the 
Mouse Locus Catalogue (gene descriptions) and 
characteristics of inbred strains. 


Mouse Locus Catalog (MLC) see Mouse Genome 
Database (MGD) 


mousedb 
http://www.hgmp.mre.ac.uk 

The Harwell mouse database compiled from 
published information, includes man-mouse homo- 
logies, mouse gene list and data from the Mouse 
Chromosome Atlas maps. 


MsqDB 
http://klab.agsci.colostate.edu 

Genetic and physical chromosome mapping data 
across mosquito species. 


MycDB see Mycobacterium Database 


Mycobacterium Database (MycDB) 
http://www.biochem.kth.se / MycDB.html 

An integrated mycobacterial database containing 
data on physical and genetic mapping, and 
nucleotide sequences of mycobacterial genomes. 
Curators: Staffan Bergh, Royal Institute of 
Technology, Stockholm (e-mail: staffan@biochem. 
kth.se) and Stewart Cole, Pasteur Institute, Paris 
(e-mail: stcole@pasteur.fr). 


Mycobacterium tuberculosis 
http://www.sanger.ac.uk/ pathogens / 
Cosmid sequences of the M. tuberculosis genome. 
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National Center for Biotechnology Information 
(NCBI) 

http: / /www.ncbi.nlm.nih.gov/ 

Responsible for building, maintaining and dis- 
tributing the DNA sequence database GenBank. 


National Center for Genome Resources (NCGR) 
http://www.ncgr.org / 

Holds GSDB, a genome sequence database with 
links to GDB, SWISS-PROT, etc., the SIGMA system 
for integrated genome map assembly. Information 
on ethical, legal and social implications of bio- 
technology. 


National Center for Human Genome Research 
(NCHGR) 

http:/ /www.nchgr.nih.gov/ 

The centre heading the Human Genome Project for 
the National Institutes of Health (NIH) in the United 
States. Information on projects underway, including 
clinical gene therapy; diagnostic development; 
education and training; genetic resources, Labora- 
tory of Cancer Genetics; Laboratory of Gene Trans- 
fer; Laboratory of Genetic Disease Research; medical 
genetics; technology transfer. 


National Institutes of Health 
http://www.nih.gov/ 


National Library of Medicine 
http://www.nlm.nih.gov/ 


NIGMS (National Institute of General Medical 
Sciences) Human Genetic Mutant Cell Repository 
http: / /arginine.umdnj.edu/coriell/nigms.html 
Human cell cultures are available in the following 
categories: inherited disorders with characterized 
mutation; well-characterized chromosomally aber- 
rant cell cultures; CEPH Reference Families; a 
human diversity collection: and human-rodent 
somatic cell hybrid mapping panels. 


NIH GenoBase Server 

http:/ /dert.nih.gov.8004 

Molecular biology database which incorporates and 
links the contents of several large sets of data 
including the EMBL sequence database and SWISS- 
PROT. The server also holds data from the Myco- 
plasma capricolum genome project. Text and BLAST 
searches of GenBank possible. 


NRSub see Bacillus subtilis Database 


Oak Ridge National Laboratory 
http://www.ornl.gov 
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Links to all genome centres participating in the US 
Human Genome Project. 

http:/ /avalon.epm.ornl.gov 

Informatics Group. Software for sequence analysis. 


Online Mendelian Inheritance in Animals (OMIA) 
http:/ /www.angis.su.oz.au/BIRX/omia/ 
omia_form.html 


Online Mendelian Inheritance in Man (OMIM) 
http://gdbwww.gdb.org/omim/docs/omimtop. 
html 

The on-line version of Mendelian Inheritance in Man. 
Selected tables of mapped human disease genes are 
reproduced with kind permission in Appendix VII 
of this book. 


Organelle Genome Database (GOBASE) see Canad- 
ian Genome Analysis and Technology Program 
(CGAT) 
Organelle Genome 
(OGMP) 

http:/ /megasun.bch.umontreal.c./ogmpproj-html 
Organelle information and sequence databases. 
Protist image databases. 


Megasequencing Program 


Oslo Biotechnology Centre 
http:/ /bioslave.uio.no:8001 


PathoGenes 

http:/ /probe.nalusda.gov:8000/ plant 

A database about fungal pathogens of small-grain 
cereals. 


Philadelphia Genome Center (University of Penn- 
sylvania Computational Biology and Informatics 
Laboratory) 

http:/ /www.cbil.upenn.edu/HGC22.html 

Human chromosome 22 sequencing and mapping 
information. 

http: / /www.cbil.upenn.edu/~sdong/genlang_ 
home.html 

Sequence analysis software. 


PigMap 

http: / /www.ri.bbsre.ac.uk/pigmap 

Information on the collaborative project to construct 
a genetic linkage map of the pig genome, and 
database of pig genome information. 


PIR (the Protein Identification Resource, a protein 
sequence database) see Johns Hopkins _ Bio- 
informatics WWW Server; Geneva University; UK 
Human Genome Mapping Project Resource Centre 
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PomBase (Schizosaccharomyces pombe database) see 
Sanger Centre 


Prodom (Protein Domain Database) see Sanger 
Centre 


PROSITE see Geneva University 


Radiation Hybrid Database (Rhdb) 

http:/ /www.ebi.ac.uk/RHdb/ 

An archive of raw data with links to other related 
data bases. 


RATMAP 

http:/ /ratmap.gen.gu.se/ 

Database covering genes physically mapped to 
chromosomes in the laboratory rat and kept by the 
Department of Genetics in Gothenburg. the service 
contains general information, rat genetic nomen- 
clature, rat locus list, and literature references sorted 
by rat gene symbol. 


REBASE 
http:// www.gdb.org /Dan/rebase/rebase.html 
Database of restriction enzymes. 


Reference Library Database (RLDB) and Reference 
Library System 
http://rzpd.rz-berlin.mpg.de/RLDB/ 

Database of hybridization and PCR work on filters 
of cosmid, YAC, P1 and cDNA libraries. Lists of 
probes/clones freely available from the Reference 
Library. 


REPBASE 

Contains prototypical interspersed repetitive ele- 
ments from primates, rodents, mammals, verte- 
brates, invertebrates, and plants, as well as a 
collection of prototypical simple DNA sequences in 
primates. the database also contains collections of 
occurrences of Alu, L1, MIR, and THE repetitive 
elements. Available via anonymous ftp to: ncbi.nlm. 
nih.gov. in the directory ‘repository /repbase’. 


Resourcen Zentrum Max-Planck-Institiit fiir Mole- 
kulare Genetik 
http://rldb.rz-berlin.mpg.de/main_e.html 
Arrayed cosmid, YAC, P1 and cDNA libraries of 
human chromosome-specific, Drosophila, and Schizo- 
saccharomyces pombe clones. 


Ribosomal DNA see Sanger Centre 


RiceGenes 
http:/ / probe.nalusda.gov:8000/ plant/ 
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A database for the rice genome. Curator: Edie Paul, 
Cornell University (e-mail: epaul@nightshade.cit. 
cornell.edu) 


Roslin Institute, Edinburgh 

http:/ / www.ri.bbsrc.ac.uk/homepage.html 

Holds BovMaP, ChickMaP, PigMaP, SheepBase 
genome databases; also a wide range of sequence 
databases and tools, and lists of other genome 
databases. Will hold the comparative mapping 
database TCAGdb. 


Saccharomyces Genome Database (SGD) 
http: //genome-www.stanford.edu 


Saccharomyces Genome Project 
http://www.mips.biochem.mpg.de/ 

http: / /www.embl-ebi.ac.uk 

http:/ /www.sanger.ac.uk/yeast/home.html 
http://genome-www.stanford.edu (Saccharomyces 
Genome Database) http: / / www.nig.ac.jp 

http:/ /www.ncbi.nlm.nih.gov/ 

http:/ /www.ncbi.nlm.nih.gov /XREFdb 
http://quest7.proteome.com/YPDhome.html (the 
Yeast Protein Database) 

http://expasy /hcuge.ch/cgi.bin/list?yeast.txt 


Sanger Centre 

http:/ /www.sanger.ac.uk / 

Information on the C. elegans genome sequencing 
project and access to ACeDB and its derivative data 
bases—Wormpep: predicted proteins from the C. 
elegans project; Prodom: Protein Domain Database. 
Information on the yeast genetics project at the 
centre and sequencing projects on Mycobacterium 
tuberculosis, Saccharomyces cerevisiae, Schizosaccharo- 
myces pombe (PomBase), and human chromosomes. 
Software for DNA sequencing and sequence 
analysis. 


Schistosoma Genome Project 
http://www.nhm.ac.uk/schistosome 


Schizosaccharomyces pombe (NIH fission yeast 
information) http://www.nih.gov/sigs/ yeast /fis- 
sion.html 


SheepBase 

http://dirk.invermay.cri.nz 

An up-to-date compilation of published data from 
sheep genome mapping projects. Compiled by the 
New Zealand Sheep Genome Programme. 


SolGenes 
http://probe.nalusda.gov:8000/ plant/ 
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A genome database containing information about 
potatoes, tomatoes, and peppers. Curator: Clare 
Nelson, Cornell (e-mail: cnelson@nightshade.cit. 
cornell.edu). 


SorghumDB 

http:/ / probe.nalusda.gov:8000/plant/ 

A genome database (still under development) for 
sorghum. Curator: Najeeb Siddiqi, Texas A&M 
University (e-mail: nus6389@tam2000.tamu.edu). 


Southern France Human Genome Project Com- 
puting Resource Centre see Genestream at EERIE 


Stanford Human Genome Center 

http:/ /shge.stanford.edu 

Chromosome 4. Generation of STSs throughout 
human genome. Construction of radiation maps of 
human genome. 


Stanford University DNA Sequence and Technology 
Center 

http: / /genome-www-stanford.edu/SDSATC/staff. 
html 

Development of high throughput sequencing 
methodology. 


Swiss Federal Institute of Technology, Zurich see 
CBRG at ETHZ 


SWISS-2DPAGE see Geneva University 
SWISS-3DPAGE see Geneva University 
SWISS-PROT see Geneva University 


TBASE: the Transgenic/Targeted Mutation Data Base 
http://www.gdb.org/Dan/tbase/tbase.html 
Information on transgenic animals and targeted 
mutations. Covers mouse, rat, pig, and Drosophila. 


TIGR (the Institute for Genomic Research) 
http://www.tigr.org 

The Microbial Data Base (MDB) provides access 
to the genome sequences for the Haemophilus influen- 
zae, Mycoplasma genitalium and Methanococcus janas- 
chii genomes. The Human cDNA Database (HCD) 
provides researchers at non-profit institutions 
access to CDNA/EST sequence and related data. The 
Expressed Gene Anatomy Database (EGAD) links 
expression data, cellular roles, and alternative splic- 
ing information to a curated, non-redundant set of 
human transcript sequences and their function, 
cellular role, tissue distribution. The Sequences, 
Sources, Taxa database (SST) provides links between 


source, collection, taxonomy, and molecular sequence 
data. Also human chromosome 16 sequencing. 
Software tools available to academic researchers on 
request by e-mail to tools@tdb. tigr.org. 


Tokyo University Insect Group 

http: / /www.ab.a.u-tokyo.ac.jp/sericulture /shi- 
mada.html 

Genetic maps of the silkmoth Bombyx mori. 


TreeGenes 

http: / / probe.nalusda.gov:8000/plant/ 

A genome database (still in development) for forest 
trees, part of the Dendrome Project. 


Tumour Gene Database 
http: / /condor.bem.tmc.edu/oncogene.html 


UK Human Genome Mapping Project Resource 
Centre (HGMP-RC) 

http://www.hgmp.mrc.ac.uk 

Access to many genomic databases and resources 
and extensive links to other sites. Software. 


UniGene: Unique Human Gene Sequence Collection 
http:/ /www.ncbi.nlm.nih.gov/Schuler/UniGene 
Database holding clusters of human EST sequences 
that represent the transcription products of distinct 
genes. 


University of California, Berkeley 
http:/ /fruitfly.berkeley.edu 
Drosophila Genome Center. 


University of California, Davis 
http://www.vgl.ucdavis.edu/~lvmillion 
Horse genetics. 


University of Michigan Human Genome Center 
http://www.hgp.med.umich.edu/ 

DNA sequencing resources. Whitehead / MIT mouse 
map data. CEPH-Généthon physical map data. 
Links to many other sites. 

http:/ / www.sph.umich.edu/group/statgen/soft- 
ware 

Radiation hybrid mapping software. 


University of Minnesota Medical School, Computa- 
tional Biology Center 

http:/ /www.cbe.med.umn.edu/ 

Info on sequence analysis projects on Arabidopsis, 
maize, rice, loblolly pine. Sequence analysis soft- 
ware. Candida albicans molecular biology. 

http:// lenti.med.umn.edu/zebrafish/zfish_top_ 
page.html 
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Zebrafish sequence analysis project. 
http://lenti.med.umn.edu/MolBio-man/ 
Unofficial guide to GCG (‘Wisconsin’) software 
package. 


University of Texas Health Science Center 
http:/ /mars.uthscsa.edu/ 
Human chromosome 3. 


University of Texas Southwestern Medical Center 
http://mcdermott.swmed.edu 

Human chromosome 11. 

http:/ /eatworms.swmed.edu 

Caenorhabditis elegans WWW server. 


University of Utah 
http://www-genetics.med.utah.edu 
Resources for genome sequencing. 


University of Wisconsin E. coli Genome Center 
http:/ /ecoliftp.genetics.wisc.edu 
Sequencing of the E. coli genome. 


USDA Agricultural Plant Genomes/ USDA / ARS/ 
NAL Plant Genome Data 
http://probe.nalusda.gov:8000/ plant/index.html 
Plant DNA libraries. Information on plant genome 
mapping projects. Contains molecular and pheno- 
typic information about the genomes of Arabidopsis, 
alfalfa, Phaseolus, Vigna, Chlamydomonas reinhardtii, 
legumes, cotton, cereals, maize, fungal pathogens 
of small-grain cereals, rice, Solanaceae, sorghum, 
soybeans, forest trees. 


US Human Genome Project: DOE and NIH Human 
Genome Research Sites 
http://www.ornl.gov/TechResources /Human_Ge- 
nome/CENTERS.HTML 

Lists participating centres in the US Human 
Genome Project with contact addresses and in- 
formation on projects. 


V BASE 

http: / /www.mre-cpe.cam.ac.uk/imt-doc/vbase- 
home-page.html 

A directory of human immunoglobulin V genes 
compiled from published sources. 


Vertebrate Comparative Database (CompDB) 
http:// www.hgmp.mre.ac.uk/ Comparative/home 
-html 

Contains homologue data between human genes 
and a range of other vertebrate species. 


Virtual Genome Center 

http:/ /alces.med.umn.edu/VGC.html 

Contains sequence analysis tools: query GenBank, 
SWISS-PROT databases. Useful databases and 
tables: codons; size of human chromosomes, human 
repeated DNA. Sequences of the S. cerevisiae chro- 
mosomes. Candida albicans: physical map, sequence 
data, strains, resources. 


Walter and Eliza Hall Institute of Medical Research 
(WEHI) 

http:/ / www.wehi.edu.au 

GCG programs and manuals. MHCPEP database. 
SRS malaria database. Graphics interface to the 
Brookhaven Protein Data Bank. GDB. 


Washington University School of Medicine 
http://genome.wustl.edu/ 
Human X chromosome. 


Washington University School of Medicine Genome 
Sequencing Center 

http://genome.wustl.edu/ gsc/gschmpg.html 
Caenorhabditis elegans sequencing. EST sequencing. 


Washington University, Department of Pathology 
http:// www.pathology.washington.edu 

Human and mouse chromosome ideograms. Horse 
idiographic karyotype. Cytogenetic gallery of 
scanned photomicrographs of abnormal human 
karyotypes. Scanned images of human and mouse 
chromosome spreads. 


Weizmann Institute of Bioinformatics 
http://dapsas1.weizmann.ac.il 

Design and development of tools for Bioinfor- 
matics, especially in the areas of molecular biology 
and the human genome. Holds a list of genome and 
molecular biology sites. 


Whitehead Institute Mouse Genetic Map Infor- 
mation 

http:// www-genome.Wwi.mit.edu / genome_data/ 
mouse/mouse_index.html 

Data representing the Whitehead Institute / MIT 
Center for Genome Research mouse genetic map. 


Whitehead Institute for Medical Research/MIT 
Center for Genome Research 
http://www-genome.wi.mit.edu / 

Whitehead Institute STS/YAC Map. Links to other 
sites. Other resources. 
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XREFdb 

http: / /www.ncbi.nlm.nih.gov /XREFdb/ 

Database cross-referencing the genetics of model 
organisms with mammalian phenotypes. Provides 
similarity search, mapping and relevant mammalian 
phenotype information, and also BLAST similarity 
search results that identify significant matches 
between sequences of model organism proteins and 
mammalian peptide sequences predicted by con- 
ceptual translation of ESTs. Originally funded by 
the NCHGR to cross-reference the Saccharomyces 
cerevisiae and mammalian genomes, XREFdb has 
recently been expanded to accept protein queries 
from other model organisms including Caenorhab- 
ditis elegans, Drosophila melanogaster, Escherichia coli, 
Mus musculus, Rattus norvegicus, Schizosaccharomyces 


CPO ROS eeeorsnesuonssewebessteesesces 


pombe and Xenopus laevis. Information: info@gmail. 
bs.jhu.edu or basset@ncbi.nlm.nih.gov 


Yale Genome Center 
http:// paella.med.yale.edu 
Human chromosome 12. 


Yeast Protein Database (YPD) 
http:/ /www.proteome.com/YPDhome.html 
Database of Saccharomyces cerevisiae proteins. 


Zebrafish Sequence Analysis Project 

http: //lenti.med.umn.edu/zebrafish/zfish_top_ 
page.html 

Zebrafish project at the University of Minnesota. 


dirbal a incline 
teed ties oe Bode” 7. inl a 
ela 
1} te poo aod 


Oia oe sea 
“wre a w fac 


*, Glee *> : 
pares il tm 
a an 


- 
_ 


SCSCOHCHCSOHCHOSCH OKC ERESEEEOSOEEEOHEEESEE OBOE DE 


®@eeeseeeeseseegces es 


Appendix! Materials, media and 
solutions 
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acrylamide stock solution 40% (w/v) (acrylamide /bis- 
acrylamide, 37.5 : 1) 

Dissolve 100 g acrylamide and 2.7 g bis-acrylamide in H,O 
to a final volume of 250 ml. Store in dark glass bottles at 
4°C. 

Acrylamide is toxic and is absorbed through the skin. Always 
wear gloves and work in the hood. 


acrylamide solution 40% (w/v) for gradient gels for 
DNA sequencing (acrylamide/bis-acrylamide, 37.5 : 1) 
Dissolve 380 g acrylamide and 20 g bis-acrylamide in H,O 
to a final volume of 1 litre. Add 20 g mixed bed resin (e.g. 
Amberlite MB-1 or equivalent) and stir carefully for 20 
min (this step removes metal ions and acrylic acid). Store 
in dark glass bottles at 4°C. 

Always wear gloves and work in the hood. 


acrylamide stock solution 6% (w/v) (0% denaturant 
stock solution) in TAE buffer 

For 500 ml: 75 ml acrylamide (40% stock), 25 ml TAE (20 x 
stock), and H,O up to 500 ml. : 


alkaline lysis DNA miniprep solutions 

Alkaline lysis I solution: 50 mM glucose, 25 mM Tris-HCl 
pH 8.0, 10mm EDTA. Alkaline lysis II solution: 0.2M 
sodium hydroxide, 1% SDS; Alkaline lysis III solution: 3M 
potassium acetate, 2M acetic acid. 


ammonium persulphate stock solution (10% w/v) 
Dissolve 10 g ammonium persulphate to 100 ml H,O. This 
solution is usually freshly prepared but may also be 
stored in small aliquots (1 ml) at —20°C. 
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beta-mercaptoethanol (B-ME) 

The stock solution of B-ME is 14.4M. To prepare a 1-M 
stock solution, for example, take 1 ml B-ME (14.4 M) and 
13.4 ml double-distilled H,O. For a final concentration of 
5x 10°M it may be preferable to make a 1: 100 dilution and 
use this to make the final stock solution. Filter the 5 x 10° 
M B-ME, dispense into 5-ml aliquots and store at 4 °C. 


buffers for reverse phase HPLC 
Buffer A: 5% acetonitrile, 95% 100 mM TEA. Buffer B: 65% 
acetonitrile, 35% TEA. 


Church buffer (for pre-hybridization and hybridization) 
0.5M sodium phosphate (pH 7.2), 7% SDS, 1 mmM EDTA 
(pH 8.0). 


denaturant stock solution for electrophoresis (80%) (6% 
acrylamide, 32% formamide, 5.6 M urea) 

For 500 ml: 170 g electrophoresis-grade urea, 75 ml 
acrylamide (40% stock), 160 ml deionized formamide 
(100% stock), 25 ml TAE buffer (20 x stock), and H,O to 
500 ml. Store in dark glass bottles at 4 °C. To deionize 
formamide: add 2 g of mixed bed resin (J.T. Baker) to 

100 ml formamide and stir for 30 in. Filter to remove resin 
and store in dark glass bottles. 


denaturing solution for chromosomes 
35 ml formamide, 5 ml 20 x SSC (pH 5.3), 10 ml sterile 
distilled H,O. 


Denhardt's solution 
10 x: 0.2% bovine serum albumin/0.2% Ficoll 400/0.2% 


polyvinylpyrrolidone (MW ~44 000). This can be 
prepared as a 100 x stock. 
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DNA 

Herring or salmon sperm DNA: Dissolve 10mg ml DNA 
in sterile H,O and sonicate to a fragment length of 

~500 bp. Store frozen in aliquots of 100-200 ul. 


ethidium bromide (10 mg ml) 
Dissolve 1 g ethidium bromide in 100 ml H,O. 


fixative for chromosomes 
Methanol/ glacial acetic acid, 3: 1. 


formamide/SSC (50% formamide, 2 x SSC (pH 7.0) and 
70% formamide, 2 x SSC (pH 7.0)) 

For 500 ml: 250 ml (= 50%) or 350 ml (=70%) formamide, 
50 ml 20 x SSC. Adjust pH to 7.0 with HCI. Unlike 
formamide used in hybridization mixture, formamide for 
these two solutions does not need to be deionized. Both 
solutions can be stored at room temperature, but check 
that the pH = 7.0 prior to use. 


GET buffer (for alkaline lysis DNA miniprep) 
50m glucose, 25 mM Tris-HCI (pH 8.0), 10 mm EDTA. 


HAT medium and supplements 
Solution 1: methotrexate (alternatives are amethopterin or 
aminopterin). Add 0.045 g methotrexate to 10 ml distilled 
H,0. Add 1M NaOH until the methotrexate dissolves. 
Add 10 ml of distilled H,O. Adjust the PH to between 7.5 
and 7.8 with 1M HCl. Make up to 100 ml. Filter-sterilize 
and store at —20 °C. 
Solution 2: hypoxanthine and thymidine (HT). Add 0.14¢ 
hypoxanthine to 30 ml distilled H,O. Add 1M NaOH until 
the hypoxanthine dissolves. Adjust the pH to 10 with 1m 
HCl. Add 0.039 g thymidine to 35 ml distilled H,0. 
Combine the hypoxanthine and thymidine solutions and 
adjust to 100 ml. Filter-sterilize and store at-20°C. Add 1 
mi of Solution 1 and 1 ml of Solution 2 to 98 ml of growth 
medium. 
Supplements for HAT medium. 
BUGR: 100 x =0.3¢ 5-bromo-2’-deoxyuridine per 100 ml 
1,0 (approx. 1 x 107M, so that 1x=1x 105 M). Store 
rozen. Light sensitive. 
6-Thioguanine (2-amino-6-mercaptopurine). 50 x = 25 mg 
in 150ml H,O (so that 1 x =2x 105 M). Add 1N NaOH to 
dissolve and adjust pH to 9.5 with 1N acetic acid. Filter- 
sterilize and store at -20°C. 
8-Azaguanine: 100 x stock = 76 mg in 50 ml (so that 1 x = 
1x104m). Add 1 N NaOH to dissolve; heat to 37°C if 
necessary and adjust pH to 9 with 1 N acetic acid. 


Hirt squirt (for cell lysis) 
0.8% SDS/10 mm EDTA. 


hybridization buffer (for microFISH) 
50% formamide, 10% dextran sulphate, 2 x SSC, 1% Triton 
X-100, sterile distilled HO. 


hybridization mix (for FISH) (2 x SSC, 50% formamide, 
10% dextran sulphate, 1% Tween 20) 

For 10 ml: dissolve 1 g dextran sulphate in 1 ml 20 x SSC, 
1m1 10% Tween 20, and double-distilled H,0 toa total 
volume of 5 ml. Add 5 ml deionized formamide. Check 
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that pH = 7.0 and store at -20 °C in aliquots of 12 ul for 
single use. 


ionic extraction buffer (for detergent extraction of DNA 
from M13 phage) 

Tris-HCl (pH 8.0), 1 mm EDTA, 125m potassium iodide, 
0.16 mM potassium lauryl sulphate. 


kinasing buffer 
0.5 M Tris (pH 7.5), 10mM ATP, 20 mm DTT, 10mm 
spermidine, 1 mg ml BSA, 100mm MgCl. 


Leishman’s stain 
3 g of Leishman’s powder dissolved in 1 litre methanol. 


ligation buffer (single-stranded, for use with T4 RNA 
ligase, for adding oligonucleotide to 5’ end of CDNA) 
100 mn Tris-HCl (pH 8.0), 20mm MgCl, 20 ug ml" BSA, 
50% PEG, 2 mM hexamine cobalt chloride, 40 um ATP. 


ligation buffer 1x (for YAC ligation, for use with T4 DNA 
ligase) 

50mm Tris-HCl (pH 7.6), 30 mm NaCl, 10 mm MgCl, 1 x 
polyamines (0.75 mM spermidine, 0.30 mm spermine) (the 
polyamines may be omitted). 


ligation buffer 10x 
400 mM Tris-HCl (pH 7.6), 100mm MgCl, 1mm DTT, 5mm 
ATP. 


ligation buffer 5x (for use with DNA ligase, for ligating 
cosmid DNA into plasmids) 

250 mM Tris-HCl (pH 7.6), 50mm MgCl, 5mM ATP, 5mm 
DTT, 25% PEG 8000. 


ligation buffer/mix 10x 

Low salt buffer: 60 mm Tris (pH 7.5), 60 mM MgCl, 50 mm 
NaCl, 2.5mg ml" BSA, 70 um B-mercaptoethanol, with 
ligation additions: 1 mm ATP, 20mm DTT, 10mm 
spermidine, 1 mg mI BSA, 100 mm MgCl. 


linear polyacrylamide (LPA) carrier 

A reliable and completely noninjurious inert carrier 
allowing efficient Precipitation of picogram quantities of 
DNA. Prepare by polymerization of a 5% acrylamide 
solution with ammonium persulphate (0.1%) and TEMED 
(0.1%). This solution is 50 mg ml!” and a working solution 
at 2mg ml" is diluted from this. This is stored at AVE 
and may be frozen and thawed many times. Usually, 

5-10 lg per precipitation reaction is sufficient. 


loading buffer for RNA fractionation (poly(A) mRNA 
isolation on oligo-dT columns) 

0.5 lithium chloride, 50 mM Tris (pH 8.0),5 mm EDTA, 
1% SDS. 


loading buffer for SSCP analysis 
95% formamide, 20mm EDTA (pH 8.0), 0.05% 
bromophenol blue, 0.05% xylene cyanol. 


loading solution for DGGE 5x 
0.25% (w/v) bromophenol blue, 0.25% (w/v) xylene 
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cyanol, 20% (w/v) Ficoll. Dissolve 20 g Ficoll, 250 mg 
bromophenol blue and 250 mg xylene cyanol ina final 
volume of 100 ml H,O. Instead of Ficoll, glycerol can be 
used. 


lysis buffer (for mammalian cells) 
guanidinium thiocyanate (GuSCN) in 25% lithium 
chloride (LiCl). 


Macllvaine’s buffer, pH 5.6 

0.1M anhydrous citric acid (solution A), 0.4M anhydrous 
sodium phosphase dibasic (solution B). For buffer, use 92 
ml solution A and 50 ml solution B. 


middle wash buffer (MWB) (for poly(A) mRNA isolation 
on oligo-dT columns) 
100 mM LiCl, 50 mM Tris (pH 8.0),5 mm EDTA, 1% SDS. 


mountant for chromosomes 
Citifluor/PI: 1 ml citifluor mountant, 8 ul propidium 
iodide (50 ug mI"). 


nick translation buffer 10x 
500 mM Tris-HCl (pH 7.5), 50mm MgCl, 500 ug mi 
nuclease-free BSA. 


nucleotide mix for nick translation with biotin 10x 
500 uM dCTP, 500 um dGTP, 500 um dATP, 380 um dTTP, 
30 uM biotin-16-dUTP. 


osmotic shock medium 
PBS/10% dimethyl sulphoxide (DMSO) 


panning buffer (for mammalian cells) 
PBS, 2mM EDTA, 0.1% sodium azide, 5% FCS. 


PBS 1x 

Solution A (1 litre): 8.20 g NaCl, 1.78 g Na, HPO,-2H,0. 
Solution B: (500 ml) 4.14 g NaCl, 0.69 g NaH,PO,-H,O. 
Adjust the pH of solution A with solution B to 7.0. 
Sterilize by autoclaving. Store at room temperature. 


PCR buffer 1x 
10 m Tris-HCl (pH 8.9), 50mm KCI, 1.5mm MgCl, 0.1% 
(w/v) Triton X-100, 0.01% (w/v) gelatin. 


PCR buffer 2x 
20 mM Tris-HCl (pH 8.4), 100 mm KCL, 10mm MgCl, 
0.2mg ml" gelatin. 


PCR buffer 10x 
100 mM Tris-HCl (pH 9.0) at 25 °C, 500 mm KCI, 15 mM 
MgCl, 1.0% Triton X-100. 


PCR mix 1x 
10 mn Tris-HCl (pH 8.3), 50 mm KCI, 1.5mm MgCL, 
dNTPs (250 uM each). 


PEG solution for transformation of yeast spheroplasts 
20% polyethylene glycol 6000 MW, 10m Tris-HCl (pH 
7.6), 10 mM CaCl,. 


PEG solution for DNA precipitation 
26.2% PEG 8000, 6.6 mM MgCl, 0.6M NaOAc (pH 5.2). 


primer, Alu-BK33 
5’CTGGGATTACAGGCGTGAGC3’. 


primer, 6-MW 
5’CCGACTCGAGNNNNNNATGTGG3’. 


reverse transcription buffer 5x 
250 mM Tris-HCl (pH 8.3), 400 mm KCI, 15mm MgCl, 
50 mm DTT. 


RNase 

Prepare a stock solution of 10 mg ml" RNase in sterile 
H,O. Inactivate any contaminating DNase by boiling for 
10 min. Cool to room temperature and store frozen. 


SCE 1x 
1Msorbitol, 0.1 M sodium citrate (pH 5.8), 10 mM EDTA 
(pH 7.5). 


Sorensen buffer 

Solution A (100 ml): 0.946 g Na,HPO, (anhyd.) (= 0.06). 
Solution B (100 ml): 0.908 g KH,PO, (=0.06™). 

Adjust the pH of Solution A with Solution B to 6.8 (about 
1:1). For 50% Serensen buffer, mix 1 : 1 with double- 
distilled H,O. 


SOS solution 
1M sorbitol, 25% YPD, 6.5 mM CaCl,, 10 1g ml? Tug ml 
uracil. 


SSC 1x 
150 mM NaCl, 15 mM sodium citrate, pH 7.0. 


SSC 20x 

3M NaCl, 0.3M sodium citrate. For 1 litre: 175 g NaCl, 88 g 
sodium citrate. Adjust pH to 7.0 with HCl. Sterilize by 
autoclaving. 


SSCT 

4x SSC, 0.05% Tween 20 (or Triton X-100) (pH 7.0). For 1 
litre: 200 ml 20 x SSC, 5 ml 10% Tween 20 (or Triton X-100). 
Check pH = 7.0 and store at room temperature. 


SSCT-BSA 
3% (w/v) BSA in 4 x SSC, 0.05% (v/v), Triton X-100. 


SSCTM 

Prepare fresh. Dissolve 0.5 g 99% fat-free dried milk 
(Marvel) in 10 ml SSCT. Centrifuge at 1500 r.p.m. for 5 min 
in order to pellet undissolved particles. Soak off and 
discard the cloudy top layer and use the clear solution in 
the middle of the tube. 


STC 
1M sorbitol, 10 mM Tris-HCl (pH 7.6), 10mm CaCl,. 


TAE 50x 
For 1 litre: 242 g Tris base, 100 ml 0.5 M EDTA, pH 8.0. 
Adjust pH to 7.2 with glacial acetic acid (about 57 ml). 
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TBE 0.5x 6% gel solution 

460 g urea, 150 ml 40% acrylamide solution, 50 ml 10x TBE 
buffer. Make up to 1 litre with H,O with dissolve. Filter 
through sintered glass funnel and store in the dark at 4°C 
for up to several weeks. 


TBE 5x 6% gel solution 

115 g urea, 37.5 ml 40% acrylamide solution, 125 ml 10x 
TBE buffer, 10 mg bromophenol blue (optional). Make up 
to 250 ml with H,O and dissolve. Filter through sintered 
glass funnel and store in the dark at 4°C for up to several 
weeks. 


TBE buffer 1x 
89 mM Tris, 89 mM boric acid, 2.5mM EDTA (pH 8.3). 


TBE buffer 5x 
For 1 litre: 54 g Tris base, 27.5 g boric acid, 20m10.5M 
EDTA (pH 8.0), make up to 1 litre with H,O. 


TBE buffer modified for DNA sequencing gels (does not 
precipitate) 1x 
133 mM Tris, 44 mo boric acid, 2.5 mM EDTA (pH 8.8). 


TBE buffer (pH 8.8) 10x 
For 1 litre: 162 g Tris base, 27.5 g boric acid, 9.2 g 
Na,EDTA, make up to 1 litre with TO} 


TE 

10 mM Tris, 1mm EDTA (pH 8.0). For 1 litre: 1.21 g Tris 
base, 0.37 g EDTA. Adjust pH to 8.0 with HCL Sterilize by 
autoclaving. 


TE50 
10 mM Tris-HCl (pH 7.6), 50mm EDTA. 


TEMED 
N,N,N’,N’ -tetramethylethylenediamine. 


TEN9 


50. mM Tris-HCI (pH 9.0), 100mm EDTA (pH 8.0-9.0), 
200 mm NaCl. 


TENP buffer 
10 mM Tris-HCl (pH 7.6), 20 mm EDTA, 30 mm NaCl + 
polyamines (0.75 mm spermidine, 0.30 mm spermine). 


transformation buffer I (TFBI) (for bacterial 
transformation) 

30: mM potassium acetate, 50 mm MnCl, 100 mm KCL 
10mM CaCl, 15% glycerol (v/v). 


transformation buffer II (TFBII) (for bacterial 
transformation) 


10mM Na-MOPS (pH 7.0), 75 mm CaCl, 10mm KCl, 15% 
glycerol. 


trisodium citrate (TSC) 

For a 3.3% solution, and make up a solution of 33 g TSC 
with double-distilled H,O to 1 litre. Dispense in 40-ml 
aliquots, autoclave and store at room temperature. Check 


volume and adjust with sterile double-distilled H,O 
before use. 


Triton-TE extraction buffer 
0.5% Triton X-100, 10 mM Tris-HCl (pH 8.0), 1mm EDTA 
(pH 8.0). 


tRNA 
E. coli tRNA: Dissolve 10mg mI E. coli tRNA in sterile 
H,O and store frozen in aliquots of 100-200 ul. 


Vectashield (Vector Laboratories) 

A self-prepared mixture containing 22 mg 1,4- 
diazobicyclo (2.2.2) octane (DABCO) in 1 m1 20 mm 
NaHCO, (pH 8.0), 75% glycerol, or 10mg ml“ p- 
phenylenediamine in PBS mixed 1 : 9 with glycerol and 
adjusted to pH 8.0 with 0.5 carbonate-bicarbonate 
buffer (pH 9.0). 


Wright's stain stock solution 

Dissolve 1.25 g Wright’s stain in 500 ml methanol for 
around 1h. Filter the solution through filter paper 
(Whatman no. 1) and store this Wright's stock solution at 
room temperature protected from light ina brown glass 
bottle. Older stock solutions usually give better results 
than fresh ones. Therefore, prepare the solution at least 2 
weeks before use. 


Media 


chorionic villus sample transport medium 

100 ml basal medium, e.g. Ham’s F1, 10 ml FCS, 1 ml 
L-glutamine (200 mm), 3 ml penicillin or streptomycin 
(100001U ml" or 10000 Lg mI“), 3 ml kanamycin 

(10000 pg ml), 0.3 ml mycostatin (1000 IU ml"), and 1 ml 
heparin (1000 IU mI). 


complete medium for culturing lymphocytes 

100 ml Ham’s F10 or RPMI 1640 medium, 10 ml FCS, 

1.0 ml phytohaemagglutinin (purified), 1.0 ml penicillin 
(50001IU ml“), 1.0 ml streptomycin (5000 ug ml), 1.0 ml 
L-glutamine (200 mm). 


double-selection growth medium 

4% dextrose, 0.67% yeast nitrogen base (without amino 
acids), base and amino acid supplements (-uracil, 
-tryptophan). 


freezing medium (for lymphocyte storage) 
RPMI 1640 medium/FCS/DMSO, 2:2: 1. 


Hogness modified freezing medium (HMEM) 10x 

63 gl" K,HPO,, 18 gl KH,PO,, 4.5 gl" 1 sodium citrate, 
9g" ammonium sulphate, 440 8 |" glycerol, and 0.9¢14 
MgsO,.7H,O. 


LB broth/agar 
Standard Luria broth/ agar. 


media for culturing solid tumours 
see Chapter 8, Table 8.2. 


RPMI/Hepes 


RPMI 1640 medium containing 20 or 25mm Hepes with 
(or without) L-glutamine. 
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SD broth 

For 100 ml: 0.7 g yeast nitrogen base without amino acids, 
2g glucose, 5.5mg adenine, 5.5 mg tyrosine. Adjust to pH 
7.0 and autoclave (121 °C, 20 min). Add filter-sterilized 
solutions of: 7 ml casamino acids for double selection 
(-ura, -trp), and 2 ml 1% tryptophan for single selection 
(-ura). 


thawing medium (for lymphocytes) 
RPMI 1640/FCS, 9:1. 


TYM agar 
2% Bacto-Tryptone, 0.5% yeast extract, 0.1 M NaCl, 10mm 
MgsO,. 


YAC regeneration medium (single-selection medium) 
1 Msorbitol, 4% dextrose, 0.67% yeast nitrogen base 


(without amino acids), amino acid supplements (20x 
amino acid and adenine mixture — adenine, arginine, 
isoleucine, histidine, lysine, methionine: all at 400 mg1", 
leucine 1200 mg 1", phenylalanine 1000 mg 1, valine 
3000 mg 17, tyrosine 600 mg I"), 20 ug mI“ tryptophan, 2% 
agar. 


YPD medium 
1% yeast extract, 2% bactopeptone, 2% dextrose. Make up 
with 2% agar for plates. 


YT medium 2x 

Per litre: 16 g yeast extract, 10 g tryptone and 5g NaCl 
supplemented after autoclaving with the appropriate 
antibiotic. 
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Appendix II 


Preparation of blood 


bottles and processing 
of blood samples 


Where sufficient sample is available, blood samples taken 
for genetic analysis are routinely placed into two separate 
blood bottles for transport to the laboratory. One part is 
stored in RPMI/Hepes to ensure cell survival. The 
separated lymphocytes (see II.5) from this sample can be 
used for cytogenetic analysis (see Chapter 7, Protocol 13), 
may be stored frozen, and may also be transformed by 
Epstein-Barr virus to produce immortalized cell lines (see 
later). The other part of the sample is placed into EDTA. 
This is used to prepare a batch of DNA for immediate use. 
Always wear gloves when dealing with blood and treat 
all human tissue samples as potentially infectious. 


II.1 Preparation of blood bottles 

1.2 Filling blood bottles with blood 

II.3 Processing of blood tubes 

11.4 Lymphocyte sterile separations 

11.5 Freezing cells for storage 

11.6 Transformation of blood cells with Epstein-Barr virus 
(EBV) for long-term culture 


ll.1 Preparation of blood bottles 


Tube 1 Fifty-millilitre colour-coded (e.g. red-capped) flat- 
bottomed tube containing RPMI/Hepes to ensure cell 
survival. This will contain the blood sample to be used in 
the subsequent lymphocyte separations. 

Prepare the following mixture: 

¢ 200ml RPMI/Hepes (e.g. from Gibco-BRL, Sigma); 

e 40m13.3% trisodium citrate; 

e 2m15x10°M B-mercaptoethanol (B-ME). 

Add 20 ml per red-capped tube, seal top with a strip of 
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parafilm, add label, date and store for up to 2 months at 
4°C, 


Tube 2 Twenty-millitre sterile universal container 
containing EDTA. When filled with blood this will be 
frozen and thawed and together with the residues from 
the lymphocyte separation will be used to prepare a batch 
of DNA. 

Add 4 ml 0.25 EDTA to each 20-ml sterile universal. Seal 
with a strip of parafilm, add a label, date and store for up 
to 6 months at 4°C. 

Prepared blood bottles should be stored in the fridge. 


1.2 Filling blood bottles with blood 


Once taken, blood samples should be kept at room 
temperature; they will be stable up to a maximum of one 
week after collection. 

Remove blood bottles from fridge several hours before 
using. 

Blood bottles should be filled in the following order of 
priority for each blood sample: 
1 25ml blood into large tube (tube 1) with tissue culture 
medium; 
2 15m blood into universal (tube 2) with clear medium 
(0.25 M EDTA). 
Include a copy of pedigree if from a family, indicating the 
individual from which blood was taken. 


11.3 Processing of blood tubes 
Upon arrival at the laboratory, filled blood tubes should, 
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if possible, be processed immediately. All samples should 
be logged ona running list. 
1 The red-capped tubes with growth medium and blood 
are used to prepare sterile separations of lymphocytes. 
They can be stored for up to 1 week (after collection of the 
blood sample) at room temperature. 
2 The universal containing EDTA plus 10-15 ml of blood 
is stored at -80 °C. Before freezing, stand the tube upright 
to allow blood to separate from the plasma/EDTA. Take 
2500-1 aliquots of the plasma/EDTA and store at 80°C. 
After use, all glassware is decontaminated by overnight soaking 
in 2% chloros or by adding Weskodyne. All defibrinated clots 
plus glass beads, plastic disposables and paper tissue are double 
bagged, autoclaved and incinerated. 

Gloves should be worn throughout blood handling and 
centrifugation is carried out in sealed tubes with aerosol- 
preventing lids. 


1.4 Lymphocyte sterile separations 


Materials 

° Blood sample (e.g. 50-ml Falcon tube containing 25 ml 
blood and 25 ml citrated medium as prepared in II.1 and 
1.2). 

° RPMI/Hepes (e.g. Gibco-BRL). Store at 4°C until 
needed. Prewarm to room temperature prior to use. 

° 1M calcium chloride (dihydrate). Prepare a 100-ml 
stock solution. Dissolve 14.7 g CaCl, in double-distilled 
water and make up to a final volume of 100 ml. Autoclave 
or filter stock and dispense into a 5-ml aliquots. Store 
stocks at 4°C. 

* glass beads 4-mm undrilled (500 g) (LIF). Put 
approximately 20 beads into a glass universal and 
autoclave. 

e Lymphoprep (lymphocyte separation medium) 
(Nycomed). 

* acetic acid: Use glacial acetic acid as 100% stock and 
dilute with double-distilled water, e.g. 50 ml stock at 4% 
(3 ml acetic acid (glacial) in 48 ml double-distilled water). 
Store at 4°C. Acetic acid will oxidize with time, therefore 
the stock solution should be replaced each month. 

* nigrosine (water soluble) (25 8) (BDH). Nigrosine is 
made up as a 1% solution in PBSA. Filter after preparation 
and store at room temperature. 


Method 

1 Pour contents of tube into a 250-ml flask labelled 
clearly with patient’s name. Rinse out blood bottle with 
4ml RPMI/Hepes (the RPMI/ Hepes must be at room 
temperature). 

2 Add sterile beads (1 bead per ml blood). 

3 Keep foil on the flask and add 0.6 ml sterile 1M CaCl, 
through the foil using a 1-ml syringe and needle. 

4 Immediately after adding the CaCl,, defibrinate for 15 
min at 260 1r.p.m. ona gyratory shaker. 

5 Add 20 ml RPMI/ Hepes to flask. 

6 Divide defibrinated blood between two 50-ml Falcon 
(type 2070) tubes each containing 15 ml Lymphoprep 
(diluted blood / Lymphoprep, 2 : 1), overlaying very 
carefully with a 25-ml pipette attached toa Pipette aid 
(rinse out with RPMI jf Hepes, add to tubes). The 
Lymphoprep must be at room temperature. 

7 Spin at 1800 r.p.m. (700 8) for 20 min using a centrifuge 


with a swing-out rotor (e.g. Beckman TJ6), brining the 
speed up slowly. 

8 Using a sterile pasteur pipette, remove the interface to 
a 50-ml Falcon (type 2070) tube. Dilute 1:1 with 
RPMI/Hepes and count the cells (e.g. ina Neubauer 
counting chamber) in 4% acetic acid. Nigrosine is used to 
check cell viability. 

9 Spin at 2300 r.p.m. (1000 g) for 10 min. 

10 Aspirate off the RPMI/Hepes and freeze cells in two 
freezing vials (labelled with patient’s name, date and cell 
count), and three nonsterile straws. Vials should contain 
no less than 3x 10° cells. 

11 Keep residues for DNA extraction by aspirating off 
down to the Lymphoprep-RPMI/Hepes interface. 
Combine residues into one or two universals and freeze at 
PIE, 


ll.5 Freezing cells for storage 


Cells for freezing must be viable, therefore cell lines (e. g. 
suspension cells or attached lines) should be growing 
rapidly and must be subconfluent 80-90% of maximum. 
Primary cells for future transformation (e. g. mixed 
lymphocyte populations) should be stored only after 
careful counting. Cell stocks can be stored in a freeze mix 
of fetal calf serum and dimethy] sulphoxide (FCS/DMSO) 
ata ratio of 90: 10. 


Preparation of FCS/DMSO freeze mix 

1 Thaw a500-ml bottle of FCS and aliquot extremely 
carefully into 90-ml lots (use twice autoclaved blue- 
capped Duran bottles). 

2 Place serum into a waterbath preheated to 56°C for 30 
min. (This step eliminates complement and other 
components from the serum.) 

3 Cool sample to room temperature and add 10 ml 
DMSO to the serum. Mix thoroughly. Aliquot into 1-ml or 
5-ml lots. Store at -20 °C until needed. The mix can be 
thawed at 37 °C, though care should be taken in wiping 
off water and cleaning with industrial methylated spirits 
to avoid contamination. 


Adding cells to freeze mix 

1 After pelleting cells by centrifugation, 
medium from the cell pellet. 

2 Label vials carefully with date, cell line, name and 
passage number if applicable. 

3 Add 0.5-1 ml of freeze mix per vial of cells to be frozen. 
4 Itis essential that cells be quickly frozen once they are 
in freeze mix. If there are a large number of lines to be 
frozen, only prepare 4-6 vials at a time. Wrap vials ina 
couple of layers of tissue and place at -80°C for at least 
12h. 

5 Cells should be quickly transferred to liquid nitrogen. 
Care should be taken in handling liquid nitrogen —wear gloves. 
Take the liquid nitrogen tank to the freezer and add the 
vials to the appropriate space. 

6 Carefully note the space in which the vials are placed, 
add the details of the cell line, date and space to a cell line 
card or an appropriate computer database. Also, add 
details of the computer databases. This must be done soon 
after storing the cells. Similarly, if cells are removed from 


storage the appropriate steps should be taken to edit the 
card and computer storage. 


aspirate all the 
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Note: DMSO comes in 500-ml bottles. Only a clean bottle 
should be used for tissue culture. Handle with care — it is 
taken up through the skin in seconds. It is used to apply 
drugs topically to skin lesions. DMSO is toxic and can kill 
cells if they are not frozen immediately. 


11.6 Transformation of blood cells with 
Epstein-Barr virus (EBV) for long-term 
culture 


Overview 
(a) Preparation of EBV pool for transformation. 
(b) Transformation of lymphocytes. 


(a) Preparation of EBV pool for 
transformations 


Materials 

¢ Marmoset cell line B958 cells (tested to ensure they are 
free from mycoplasma infection) 

¢ 10% FSC/RPMI 1640 medium 


Method 

1 Grow cells to 1x 10° ml" in 10 % FSC/ RPMI 1640 
medium at 37 °C (e.g. 500-ml vols in large plastic TC 
flasks). 

2 Dilute to 0.2x10° ml". 

3 Incubate at 33 °C for 2 weeks, mixing occasionally. 

4 Allow cells to settle at 4°C overnight. 

5 Spin supernatant to clarify. 

6 Filter supernatant to be sure all cells are removed. 

7 Aliquot supernatant in 2-1 vols. Store in liquid nitrogen 
tank. 

8 Test by comparing transformation ability with previous 
batch. (Use duplicate vial of cells known to have been 
transformed successfully before.) 


(b) Transformation of batches of frozen 
lymphocytes with varying numbers of cells 


Materials 
° frozen lymphocytes 
e 20% FCS/RPMI 1640 medium 
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*° 20% FCS/RPMI 1640 medium containing cyclosporin, 

prepared as follows. 

Dissolve 3 mg cyclosporin A powder in 200 pl 
absolute ethyl alcohol. 

Add 60 ttl Tween 20. 

Add 740 il serum-free RPMI 1640 dropwise using 
whirlmixer after every drop. 

Add 2 ml RPMI 1640 medium with 10% FCS 
dropwise with whirlmixing. 

Aliquot and store at -20 °C. This medium is stable for 
several months. 

Use diluted 1 : 1000 to give 1 1g mI for use in 
transformations. 


Method 

1 Thaw lymphocytes quickly at 37 °C. 

2 Wash with 10 ml 20% FCS/RPMI 1640 medium ina 
sterile conical container removing an aliquot for 
counting / viability testing before centrifuging at no more 
than 1000 r.p.m. for 5 min. Remove supernatant. 

3 Add 0.2 ml EBV stock and incubate at 37 °C for 1h. 

4 Add 10 ml 20% FCS/RPMI 1640 medium containing 
1g ml" cyclosporin A. 

5 Distribute 2 ml to each of two tissue culture tubes 
containing 1 x 10° human fibroblasts as a feeder layer 
previously treated with mitomycin C or irradiation. 

6 After 5 days incubation at 37 °C in an atmosphere of 5% 
CO, in air, add 1 ml 20% FCS/RPMI 1640 containing 

1g ml" cyclosporin A to each tube. 

7 Twice weekly thereafter remove 2 ml medium and add 
2 ml fresh medium: 20% FCS/RPMI 1640 containing 1 ug 
ml" cyclosporin A. After 2 weeks’ culture, the cyclosporin 
Ashould be omitted. To guard against loss of cultures 
from contamination (as the culturing will proceed, on 
occasion, for up to 3 months and normally for 2 months), 
use two different bottles of medium so that the two sets of 
tubes are fed from separate sources and the same pipettes 
etc. never touch the two tubes. 

8 At 4 weeks, the tubes will normally transfer 
successfully to 25-cm’ tissue culture flasks starting with 
the flask in the upright position and only 4ml medium. 
Feeding with small volumes of medium regularly has 
been found to be more satisfactory than infrequent large 
amounts until the cultures are obviously well established. 
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Appendix Ill 


Contact addresses are listed 
alphabetically at the end of the appendix. 
This list covers materials mentioned in 
protocols and is not intended to be 
comprehensive. Inclusion here does not 
imply any endorsement by the Imperial 
Cancer Research Fund. Suppliers’ lists 
available on the World Wide Web include 
Anderson’s Timesaving Comparative 
Guides (http:/ /www.atcg.com) and 
Biosupplynet (http: / /www.biosupplynet. 
com). 


Chromatography, filtration and 
separation media (e.g. treated columns, 
magnetic beads, DNA-binding resin, 
DNA purification kits) 

Amicon 

BIO 101 

Bio-Rad 

Collaborative Research 

Dynal 

Nycomed 

Pharmacia 

Promega 

Qiagen 


Cytogenetics, and fluorochromes, 
labelled nucleotides, labelled 
antibodies, chromosome paints and 
other materials for FISH 

See Chapter 10, Table 10.1 for details for 
chromosome paints currently available 
commercially 
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Alpha Laboratories (UK) 
Appligene Oncor 

Boehringer Mannheim 
Cambio 

Citifluor Ltd. (UK) 

Cytocell Ltd 

Life Technologies (Gibco-BRL) 
Sigma 

Vector Laboratories 

Vysis 


Electrophoresis 
Bio-Rad 

FMC Bioproducts 
Pharmacia (Hoefer) 


Flow cytometry 
Becton Dickinson 
R&D systems 


General laboratory products: chemicals, 
consumables, tissue culture products, 
media, etc. 

BDH Laboratory Supplies (UK) 
Becton Dickinson 

Bibby Sterilin 

Boehringer Mannheim 

Eppendorf 

Falcon 

J.T. Baker 

Life Technologies (Gibco-BRL) 
Millipore 

Pierce 


Commercial suppliers 


Seromed 

Sigma 

Wellcome Diagnostics 
Whatman 


Microscopy and imaging (see also 
cytogenetics) 

Alpha Laboratories 

Carl Zeiss Jena GmbH 

Chroma Technology Corp 
Digital Scientific Instruments 
Leica 

Molecular Dynamics 

Nikon 

Olympus 

Perceptive Scientific Instruments 


Restriction enzymes, polymerases, 
ligases, plasmids, primers, clone 


libraries, kits etc. for cloning and PCR 


Amersham 

ATCC 

BIO 101 

Bio-Rad 
Calbiochem-Novobiochem 
Clontech 

DuPont-Merck 

Epicentre Technologies 
Life Technologies (Gibco-BRL) 
Invitrogen 

New England Biolabs 
Novagen 

Perkin-Elmer 

Research Genetics 
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Sigma 
Stratagene 
TaKaRa 


Addresses 


Advanced Biotechnologies 

Unit 7, Mole Business Park, Randalls 
Road, Leatherhead, Surrey KT22 7BA, 
UK 

Tel.: (+44 1372) 360 123 

Fax: (+44 1372) 363 263 


Alpha Laboratories Ltd 

40 Parkham Drive, Eastleigh, Hants 
SO5 4NU, UK 

Tel.: (+44 1703) 610 911 

Fax: (+44 1703) 643 701 


ATCC (American Type Culture 
Collection) 

WWW.hitp:// www.atcc.org / 

E-mail: tech@atcc.org 

12301 Parklawn Drive, Rockville, MD 
20852-1776, USA 

Tel.: (+1 301) 231 5585 and 881 2600 
Fax: (+1301) 231 5826 and 770 1848 


Amersham International ple 
Www: 

http:/ /www.amersham.co.uk/life / 
UK 

Tel.: (+44 1494) 544 000 

Fax: (+44 1494) 524 266 

USA 

Tel.: (+1) 800 323 9750 

Fax: (+1) 800 228 8735 

Europe Tel.: (+44 1494) 544 000 
Japan Tel.: (+81 3) 38 16 1091 


Amicon 

Amicon Inc., 72 Cherry Hill Drive, 
Beverly, MA 01915, USA 

Tel.: (+1) 800 343 1397 

Europe (+49 2302) 960 600 


Appligene Oncor 

Pinetree Centre, Durham Road, Birtley, 
Chester-le-Street, Co Durham, DH3 ZuD; 
UK 

Tel: (+44 191) 429 0022 


BDH Laboratory Supplies (Merck) (UK) 
Tel.: 0800 223 344 
Fax: (+44 1455) 558 586 


Beckman 

USA Tel.: (+1) 800 742 2345 

UK Tel.: (+44 1494) 441 181 
Fax: (+44 1494) 447558 

Germany Tel.: (+49 89) 38 871 

France Tel.: (+33 1) 43.01 7000 

Australia Tel.: (+61 02) 816 5288 

Japan Tel.: (+81 3) 3221 5831 


Becton Dickinson 

UK 

Tel.: (+44 1865) 748844 

Fax: (+44 1865) 781523 

USA 

Tel.: (+1) 800 223 8226/952 3222 
Fax: (+1 498) 954 2009 


Bibby Sterilin Ltd (UK) 
Tilling Drive, Stone, Staffs ST15 0SA, UK 
Tel.: (+44 1785) 812121 
Fax: (+44 1785) 813748 


BIO 101 

WWW: http://www.biol01.com 
USA 

Tel.: (+1 619) 598 7299/800 424 6101 
Fax: (+1 619) 598 0116 

UK Tel.: (+44 1582) 456 666 


Bio-Rad Laboratories 
USA Tel.: (+1) 800 4BIORAD/(510) 741 100 
UK Tel.: (0800) 181 134 
Fax: (+44 1442) 259 118 
France Tel.: (+33 1) 49 60 68 34 
Germany Tel.: (+49 89) 318 840 
Japan Tel.: (+81 3) 5811 6270 
Australia Tel.: (+61 2) 8055000 


Boehringer Mannheim 

WWW: http: // biochem.boehringer.com 
Boehringer Mannheim GmbH, D-68298 
Mannheim, Germany 

Tel.: (+49 621) 759 8545/0621 759 8568 
UK Tel.: 0800 521 578 

USA Tel.: (+1) 800 428 5433 

France Tel.: (+33) 76 76 30 86 

Australia Tel.: (+612) 899 7999 

Japan Tel.: (+81 3) 3432 3155 


Calbiochem-Novabiochem International 
WWW: http:/ /www.calbiochem.com 
USA 

Tel.: (+1) 800 854 3417/800 662 2616 
Fax: (+1) 800 776 0999 /617 577 8015 
Germany: 

Tel.: (+49 6196) 63955 

Fax: (+49 6196) 62361 

UK 

Tel.: (+44 115) 943 0840 

Fax: (+44 115) 943 0951 

Japan 

Tel.: (+81 3) 5443 0281 

Fax: (+81 3) 5443 0271 

Australia 

Tel.: (+61 612) 318 0322 

Fax: (+61 612) 319 2440 


Cambio 

E-mail: postmaster@cambio.demon.co.uk 
34 Millington Road, Cambridge 

CB3 9HP, UK 

Tel.: (+44 1223) 366 500 

Fax: (+44 1223) 350 069 


Cambridge Bioscience 

24-25 Sigent Court, Newmarket Road, 
Cambridge CB5 8LA, UK 

Tel.: (+44 1223) 316 855 

Fax: (+44 1223) 60732 


Carl Zeiss Jena GmbH 

WWW: http://www.zeiss.com 
E-mail: mikro@zeiss.de 

Tel.: (+49 36 41) 64 29 36 

Fax: (+49 36 41) 6431 44 

USA 

Tel.: (+1 914) 747 1800/800 233 2343 
Fax: (+1914) 681 7446 


Chroma Technology Corp 

72 Cotton Mill Hill, Unit A-9, 
Brattleboro, VT 05301, USA 
Tel.: (+1 802) 257 1800 

Fax: (+1 802) 257 9400 


Clontech 

WWW: http://www.clontech.com 
E-mail: tech@CLONTECH.com 

USA 

Tel.: (+1) 800 662-CLON/(415) 424 8222 
Fax: (+1) 800 424 1350/(415) 424 1064 
UK Distributed by Cambridge 
Bioscience 

Germany Tel. (+49 6221) 303 907 


Collaborative Research 

Biomedical Products Division, 
Collaborative Research Inc., 2 Oak Park, 
Bedford, MA 01730, USA 

Tel.: (+1 617) 275 0004 

Fax: (+1 617) 275 0043 


Cytocell Ltd 

Somerville Court, Banbury Business 
Park, Adderbury; Oxfordshire 
OX17 3SN, UK 

Tel: (+44 1295) 810 910 


Difco Laboratories Ltd 

PO Box 14B, Central Ave, East Molesey, 
Surrey KT8 OSE, UK 

Tel.: (+44 181) 979 9951 


Digital Equipment Corporation 


http:/ /www.digital.com/info.html 
E-mail: info@digital.com 


Digital Scientific Instruments 

36 Cambridge Place, Hills Road, 
Cambridge, CB2 1NS, UK 

Tel.: 

Fax: 


Dupont (UK) Ltd 

Nen Research Products, Wedgewood 
Way, Stevenage, Herts SG1 4QN, UK 
Tel.: (+44 1438) 734026 /28 /31 

Fax: (+44 1438) 734379 
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Dynal 

Dynal AS, Norway 
Tel.: (+47) 22 06 10 00 
Fax: (+47) 22 507015 
UK 

Tel.: (+44 151) 346 1234 
Fax: (+44 151) 346 1223 
USA 

Tel.: (+1) 800 638 9416 
Fax: (+1516) 326 3298 
Australia 

Tel.: (+61 1) 800 623 435 
Fax: (+61 3) 663 6660 
Japan 

Tel.: (+81 3) 3435 1558 
Fax: (+81 3) 3435 1526 


Epicentre Technologies 

E-mail: techhelp@epicentre.com 

1202 Ann St, Madison, WI 53713, USA 
Tel.: (+1 608) 258 3080 

Fax: (+1 608) 258 3088 


Eppendorf 

WWW: http:/ /www.eppendorf.com/ 
eppendorf 

E-mail: eppendorf@eppendorf.com 
Germany 

Eppendorf-Netheler-Hinz GmbH 
Tel.: (+49 40) 538010 

Fax: (+49 40) 5 38 01 556 

USA Tel.: (+1) 800 645 3050 


Falcon 

Marathon Laboratory Supplies, Unit 6, 
55-57 Park Royal Road, London 
NW107]J, UK 

Tel.: (+44 181) 965 6865/6886 

Fax: (+44 181) 965 0989 


Fluka 

Fluka Chemie AG, Industriestrasse 25, 
CH-9470 Buchs, Switzerland 

Tel.: (+41 81) 755 25 11 

UK Tel.: (+44 1747) 822 211 

USA Tel.: (+1 516) 467 0980 


FMC BioProducts 

WWW: http://www.bioproducts.com 
FMC Corporation, Rockland, ME, USA 
Tel.: (+1 207) 594 3400 

UK 

Distributed by Flowgen Instruments 
Tel.: (+44 1795) 429 737 

Germany Tel.: (+49 51) 522075 

France Tel.: (+33 1) 34 84 6252 
Australia Tel.: (+61 2) 520 2122 

Japan Tel.: (+81 775) 43 7235 


Genetics Computer Group (GCG) 
(Wisconsin Sequence Analysis Package) 
E-mail: info@gceg.com 


Genetix 
16 Riverside Park, Wimborne, Dorset 


BH21 1QU, UK 


SC THHOSEC SEH H SEE TED HESEHODHE SOLE Beeeeseas seooverseer 


Tel.: (+44 1202) 881122 
Fax: (+44 1202) 840577 


Gibco-BRL see Life Technologies 


Hybaid 

WWW: /http://www.hybaid.co.uk 

UK 

Tel.: (+44 181) 614 1000 

Fax: (+44 181) 977 0170 

USA Tel.: (+1) 800 634 8886/516 244 2929 


Imagenetics see Vysis 


Invitrogen 

WWW: http:/ /www.invitrogen.com 
USA 

Tel.: (+1) 800 955 6288 /619 597 6200 
Fax: (+1 619) 597 6201 

Europe 

Tel.: (431594) 515175 

Fax: (+31 594) 515 312 

E-mail: tech_service@invitrogen.nest.nl 
Australia Tel.: (+61 3) 562 6888 

Japan Tel.: (+81 3) 5684 1616 


Jencons 

Cherrycourt Way Industrial Estate, 
Leighton Buzzard, Beds LU7 8UA, UK 
Tel.: (+44 1525) 372010 

Fax: (+44 1525) 372010 


J.T. Baker 

Mallinckrodt Baker UK 
Tel.: (+44 1908) 506 000 
Fax: (+44 1908) 503 290 
Germany 

Tel.: (+49 6152) 90 33 72 
Fax: (+49 6152) 90 33 99 
France 

Tel.: (+33 1) 48 44 65 44 
Fax: (+33 1) 48 4465 18 
USA 

Tel.: (+1 908) 859 2151 
Fax: (+1 908) 854 9318 


Leica 

WWW: http:/ /www.bodan.net/Ileica 
PO Box 2040, D-35530 Wetzlar, Germany 
Tel.: (+49 64) 41 290 

Fax: (+49 64) 41 29 33 99 

Switzerland 

Tel.: (+41 71) 727 37 43 

Fax: (+41 71) 727 46 67 

USA 

Tel.: (+1 708) 405 0123/800 248 0123 
Fax: (+1 708) 405 0030 


Li-Cor Biotechnology Division 

4421 Superior St, PO Box 4000, Lincoln, 
NB 68504, USA 

Tel.: (+1) 800 645 4267 /402 467 0700 
Fax: (+1 402) 467 0819 

UK Tel.: (+44 181) 614 1000 

Germany Tel. (+49 80) 92 82 890 
Netherlands (+31 2946) 3119 


Australia Tel.: (+61 2) 417 8877 
Japan Tel.: (+81 422) 455111 


Life Sciences International 
WWW: http: / / www.lifesciences- 
intl.co.uk 


Life Technologies (Gibco-BRL) 
WWW: http: / /www.lifetech.com 
WWW: hitp:/ /www.lifetecheuro.co.uk 
(Europe) 

8400 Helgerman Ct, PO Box 6009, 
Gaithersburg, MD 20884, USA 

Tel.: (+1 301) 840 8000/800 828 6686 
Fax: (+1 800) 331 2286/716 774 6783 
UK 

Tel.: 0 800 838 380/0800 838 380 
Fax: (+44 141) 814 6260 

Japan 

Tel.: (+81 3) 3663 7974 

Fax: (+81 3) 3663 8242 


Molecular Dynamics 

WWW: http://www.mdyn.com 
USA Tel.: (+1) 800 333 5703 

UK Tel.: (+44 1494) 793377 
Australia Tel.: (+61 3) 9810 9572 
Japan Tel.: (+81 3) 3976 9692 


MWG-Biotech 

WWW: hittp://www.mwegdna.com/ 
biotech 

E-mail: oligo@mwegdna.com 

Tel.: (+49 80) 92 2 10 84 

Fax: (+49 80) 92 82 89 77 


NBS Biologicals 

New Brunswick Scientific 
Tel.: (+44 1707) 275 733 
Fax: (+44 1707) 267 859 


New England Biolabs 

WWW: hitp:/ /www.neb.com 

USA 

Tel.: (+1) 800 NEB LABS/508 927 5054 
Fax: (+1508) 921 1350 

E-mail: info@neb.com 

UK 

Tel.: 800 31 84 86 (+44 1462) 420 616 
Fax: (+44 1462) 421 057 

E-mail: info@uk.neb.com 

Germany 

Tel.: (+49 130) 83 30 31/6196 3031 
Fax: (+49 6196) 83639 

E-mail: infro@de.neb.com 

Australia Tel.: (+61 75) 940299 
Japan Tel.: (+81 3) 3272 0671 


Nikon (Electronic Imaging Division) 
WWW: http:/www.kit.co.jp/Nikon 
E-mail: nikonbio@aol.com 

USA 

Tel.: (+1 516) 547 8500 

Fax: (+1516) 547 0306 

UK 

Instruments Division, Nikon House, 380 
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Richmond Road, Kingston, Surrey KT2 


5PR, UK 


Novagen 

WWW: http://www.novagen.com 
E-mail: novatech@novagen.com 

USA 

Tel.: (+1) 800 526 7319 

Fax: (+1 608) 238 1388 

UK Tel. (+44 1993) 706 500/(+44 1670) 
732 992 


Nycomed (UK) Ltd 
Nycomed House 2111 Coventry Road, 
Sheldon, Birmingham B26 3EA, UK 


Olympus 
Olympus Optical Co. (Europe) 
Fax: (+49 40) 23 77 36 47 


Perkin-Elmer 

PE Applied Biosystems 

WWW: www.perkin-elmer.com 
E-mail: pebio@perkin-elmer.com 
USA Tel.: (+1) 800 327 3002 

UK Tel.: (+44 1925) 825 650 
France Tel.: (+33 1) 4990 18 00 
Germany Tel. (+49 6150) 101 0 
Other Tel.: (+49) 6103 708 301 


Perceptive Scientific Instruments 
2525 South Shore Boulevard, League 
City, TX 77573, USA 

Tel.: (+1 713) 334 3207 

Fax: (+1 713) 334 3116 

International 

Tel.: (+44 244) 682 288 

Fax: (+44 244) 681555 


Pharmacia Biotech 

WWW: hitp://www.biotech. 
pharmacia.se 

Tel.: (+46 18) 16 50 11) 

UK 

Tel.: (+44 1727) 814.000 

Fax: (+44 1727) 814.001 
USA 

Tel.: (+1) 800 526 3593 

Fax: (+1 908) 857 0557 
Japan Tel.: (+813) 3492 6949 


Pierce 

WWW. http:// Wwww.piercenet.com 
E-mail: PierceChem@mcimail.com 
USA 

PO Box 117, Rockford, IL 611 05, USA 
Tel.: (+1 815) 968 0747 /800 874 3723 
Fax: (+1 815) 968 8148 /800 842 5007 
UK Tel.: (+44 1244) 382525 

Germany Tel.: (+49 22) 419 68 50 
France Tel.: (+33) 7003 88 55 


Promega 

WWW: http:// Www.promega.com 
USA 

Tel.: (+1) 800 356 9526 /(+1 608) 274 4330 
Fax: (+1) 800 356 1970/(+1) 608 277 2516 
UK 

Tel.: (+1) 800 378994 

Australia Tel.: (+1) 800 225 123 

France Tel.: (+33) 05 48 79 99 

Japan Tel.: (+813) 3669 7981 

Netherlands Tel.: (+31 71) 5324 244 
Switzerland Tel.: (+41 1) 830 7037 


QIAGEN 

UK 

Tel.: (+44 1306) 740 444/760 444 
Fax: (+44 1306) 875885 
USA 

Tel.: (+1) 800 426 8157 
Fax: (+1) 800 718 2056 
Germany 

Tel.: (+49 2103) 8920 
Fax: (+49 2103) 892 222 
Switzerland 

Tel.: (+41 61) 317 9420 
Fax: (+41 61) 317 9422 


R&D Systems 

WWW: http://www.rndsystems.com 
USA 

Tel.: (+1) 800 343 7475/(+1 612) 379 2958 
Fax: (+1 612) 379 6580 

Europe 

Tel.: (+44 1235) 531 074 

Fax: (+44 1235) 533 420 

Australia Tel.: (+61 62) 008 25 1437 
Japan 

Tel.: (+81 3) 5684 1522 

Fax: (+81 3) 5684 1633 


Research Genetics 

WWW. http://www.resgen.com/ 
Research Genetics Inc., 2130 Memorial 
Parkway, SW, Huntsville, AL 35801, USA 
Tel.: (+1) 800 533 4363 

UK 

Tel.: 0 800 89 1393 

Fax: (+44 205) 536 9016 


Sigma-Aldrich 

WWW: http: // WWw.sigma.sial.com 
E-mail: sigma-techserv@sial.com 

USA Tel.: (+1 314) 771 5750 (collect /800 
325 3010 

UK 

Tel.: (+44 1202) 733 114 (Sigma Chemical 
Co) 


Tel.: (+44 1747) 822 211 (Aldrich 
Chemical Co) 

Germany Tel.: (+49 130) 5155 
France Tel.: (+33) 05 21 14.08 


Stratagene 

E-mail: tech_services@strata gene.com 
UK 

Tel.: 0800 585 370/(+44 1223) 420 955 
Fax: (+44 1223) 420 234 

USA 

Tel.: (+1) 800 424 5444/(+1 619) 535 5400 
Fax: (+1 619) 535 0045 

Germany 

Tel.: (+49 6221) 400634 

Fax: (+49 6221) 400639 

Switzerland 

Tel.: (+1) 364 1106 

Fax: (+41 1) 365 7707 

Australia Tel.: 1800 252 204 

Japan Tel.: (+813) 3660 4819/5684 1622 


TaKaRa 

Takara Shuzo Co. Ltd, Otsu, Shiga, Japan 
Tel.: (+81 775) 43 7247 

Fax: (+81 775) 43 9254 

Europe 

Tel.: (+33 1) 41 470114 

Fax: (+33 1) 47 92 18 80 

UK 

Distributed by Severn Biotech Ltd 
Tel.: (+44 1562) 825 286 

Fax: (+44 1562) 825 284 


Vector Laboratories 
UK 

Tel.: (+44 1733) 265530 
Fax: (+44 1733) 263048 
USA 

Tel.: (+1415) 697 3600 
Fax: (+1 415) 697 0339 


Vysis (formerly Imagenetics) 
USA 

Vysis Inc. 

Tel.: (+1) 800 553 7042 

Europe (+49 711) 720 250 

UK (+44 181) 332 6932 


Wellcome Diagnostics 
Temple Hill, Dartford, Kent DA1H SAH, 
UK 


Whatman International Ltd 

WWW: http: //www.Whatman.co.uk 
E-mail: information®Whatman.co.uk 
UK 

Tel.: (+44 1622) 674 821 /674 823 

Fax: (+44 1622) 682 288 
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Appendix |V Fluorochromes and 
filter sets for FISH 


@9@GCSOeeeeeaeeHone nee ese eeeee GeOeeseeoeeee 


Table IV.1 Fluorochromes commonly used for FISH, and fluorescent stains used for chromosome banding and 
identification. 


Fluorochrome Absorption wavelength (nm) Emission wavelength (nm) 
Aminomethyl coumarin acetic acid (AMCA) 345 440 
Cy5? 650 674 
Diamidino-2-phenylindole-dihydrochloride (DAPI)’ 359 461 
Fluorescein isothiocyanate (FITC)* 495 525 
Hoechst 33258" 365 480 
Tetramethyl rhodamine isothiocyanate (TRITC)? 543 570 
Texas red (TR)? 596 620 


*Fluorochromes commonly used for FISH. 


>Fluorescent stains. 

Table IV.2 Fluorescence filter : pra ; y ; 

sets for the Nikon Optiphot Filter set Excitation (nm) Dichroic (nm) Barrier (nm) 

microscope. oe 
B-2A (FITC) 450-490 DM 510 BA520 
G-2A (TR) 510-560 DM 580 BA590 
UV-2A (DAPI) 330-380 DM 400 BA 420 


Table IV.3 Zeiss filter sets used for fluorescence detection and analysis. 
seer een a 


Fluorochrome Exciter filter Dichroic reflector Barrier filter Filter set 
FITC + propidium iodide BP 450-490 510 LP515 09 
Texas red /rhodamine BP 546 580 LP590 15 
AMCA/DAPI G 365 395 LP420 02 


BP, bandpass filter; LP, long-wave bandpass filter; G, solid glass filter. 
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Table IV.4 Fluorescence filter blocks on the MRC 600 confocal laser scanning microscope. 


ee ee 


Filter block Exciter filter (nm) Dichroic reflector (nm) Emission filter (nm) 
Dual channel 

Al (TR/rhodamine) 514 DF10 DR527LP 

A2 (FITC) 540 DF30 DR 565LP 

Single channel 

BHS (FITC) 488 DF 10 510LP OG 515LP 


514 DF10 54LP OG 550LP 


GHS (TR/rhodamine) 
eee eee 


e@e@6 @ 
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Appendix V Useful addresses and 
Internet connections 


(Specialist databases and other genome resource centres are listed in Chapter 37.) 


SSSSSSHSSSSHSSHSSSSSHSESHHESSHHSFSHSOSSSHSHRSOSDH9OSHSRSHKOKBSHEBE2ES >&ee 


V.1 Useful addresses 

V.2, HUGO chromosome committees 
V.3 Scientific journals, bulletin boards, 
and others 

This list does not aim to be complete, but 
many of the WWW addresses are home 
pages that will point to other sites of 
interest. 


V.1 Useful addresses 


American Society of Human Genetics 
9650 Rockville Pike, Bethesda, MD 
20814, USA 

Tel.: (+1 301) 571 1825 

Fax: (+1 301) 530 7079 

www: 

http:/ /www.faseb.org / genetics /ashg / 
ashgmenu.htm 


American Type Culture Collection 
(ATCC) 

12301 Parklawn Drive, Rockville, MD 
20852-1776, USA 

Tel.: (+1 301) 231 5585 and 881 2600 
Fax: (+1 301) 231 5826 and 770 1848 
E-mail: tech@atcc.org 

WWW: http://www.atcc.org/ 


British Council 
10 Spring Gardens, London SW1, UK 
Tel.: (+44 171) 930 8466 


Centre d’Etude du Polymorphisme 
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Humain (CEPH) 

27 rue Juliette Dodu, F-75010 Paris, 
France 

Tel.: (433 1) 4249 9862 

Fax: (+33 1) 4018 0155 


CIBA Foundation 
41 Portland Place, London W1, UK 
Tel.: (+44 171) 636 9456 


Commission of the European 
Communities 

Square de Meeus 8, B-1040 Brussels, 
Belgium 


Cooperative Human Linkage Center 
(CHLC) 
WWW: http://www.chlc.org/ 


US Department of Energy (DOE) Human 
Genome Program 

WWW: http: //www.er.doe.gov/ 
production/oher/hug_top.html 

Primer on Molecular Genetics 

WWW: 

http:/ /www.gdb.org/Dan/DOE/intro/ 
html 

White Paper on Bioinformatics 

WWW: 
http://www.gdb.org/Dan/doe/ 
whitepaper/contents.html 


Deutsche Krebsforschungs Zenter 
(DKFZ) (Heidelberg, Germany) 


WWW: http: / / genome.dkfz- 
heidelberg.de/ 


European Bioinformatics Institute 
Hinxton Hall, Hinxton, Cambridge CB10 
1RQ, UK 

WWW: http:/ /www.ebi.ac.uk/ 


European Collection of Animal Cell 
Cultures (ECACC) 

Biologics Division, PHLS CAMR, Porton 
Down, Salisbury, Wilts SP4 0JG, UK 

Tel.: (+44 1980) 610391 

Fax: (+44 1980) 611315 


European Molecular Biology Laboratory 
(EMBO) 

Postfach 10.2209, Meyerhofstrasse 1, 
6900 Heidelberg, Germany 

Tel.: (+49 6221) 387258 

Telex: 461613 

Fax: (+49 6221) 387306 


European Federation of Biotechnology 
Cambridge Biomedical Consultants, 
Schuutstraat 12, NL 2517 XE Den Haag, 
The Netherlands 

Tel.: (+31 70) 3653857 

Fax: as phone number 


Galton Laboratory (University of 
London) 
WWW: http://diamond/gene.ucl.ac.uk 
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Généthon 

1 rue de I’Internationale, 91000 Evry 
Cedex, France 

Tel.: (+33 1) 6947 2965 

Fax: (+33 1) 6077 1216 

WWW: 

http:/ /www.genethon.fr/ genethon_en. 
html 


The Genetics Society of America 
WWW: http/faseb/org/genetics/gsa/ 


gsamenu.htm 


Genome Data Base (GDB) 

1830 E. Monument St., Baltimore, MD 
21205, USA 

Tel.: (+1 301) 955 9705 

Fax: (+1 301) 955 0054 

WWW: http://gdbwww.gdb.org/ 
Queries to GDB/OMIM: For e-mail 
Query Service put ‘help’ in the body of 
an e-mail message to mailserv@gdb.org 


Howard Hughes Medical 

Institute /Human Gene Mapping Library 
25 Science Park, Suite 457, New Haven, 
CT 06511, USA 

Fax: (+1 203) 786 5534 


Imperial Cancer Research Fund (ICRF) 
PO Box 123, Lincoln’s Inn Fields, London 
WC2A 3PX, UK 

Tel.: (+44 171) 242 0200 

Fax: (+44 171) 269 3469 

WWW: http:/ /www.icnet.uk 


NIGMS Human Genetic Mutant Cell 
Repository 

Coriell Cell Repositories, 401 Haddon 
Avenue, Camden, NJ 08103, USA 

Tel.: 800 752 3805 (in USA); (+1 609) 757 
4848 (elsewhere) 

Fax: (609) 757 9737 (in USA); (+1 609) 964 
0254 (elsewhere) 

WWW: 

http: //arginine.umdnj.edu/cer/ccr. html 


Human Genome Management 
Information System (MGMIS) 
WWW: http: // ww.ornl.gov/Tech 


Resources / HumanGenome/ home.html 


The Human Genome Mapping Project 
Resource Centre (HGMP-RC) (UK) 
Sanger Centre, Hinxton Hall, Hinxton, 
Cambridge CB10 1RQ, UK 

WWW: http: // www.hgmp.mre.ac.uk 


Human Genome Organization Europe 
(HUGO) 

179 Great Portland Street, 5th Floor, 
London WIN 5TB, UK 

Tel.: (+44 171) 436 7178 

Fax: (+44 171) 436 1988 


The Institute for Genomic Research 
(TIGR) 

9712 Medical Center Drive, Rockville, 
MD 20850, USA 

WWW. http: // www.tigr.org / 


Lawrence Berkeley Laboratory (Human 
Genome Center) 

MS1-213 Lawrence Berkeley Laboratory, 
1 Cyclotron Road, Berkeley, CA 94720, 
USA 

Tel.: (+1 415) 486 6800 

Fax: (+1 415) 486 5717 

WWW: http://www- 
hgc.Ibl.gov/GenomeHome.html 
Resource for molecular cytogenetics 
WWW: hitp://rmce-www.lbl.gov 


Lawrence Livermore National 
Laboratory (Human Genome Project) 
University of California, PO Box 5507, 
Livermore, CA 94550, USA 

Tel.: (+1 415) 422 5698 

Fax: (+1 415) 423 3608 

Library information 

WWW: http:// www_bio.IInl.gov.bbrp/ 
genome.html 


Los Alamos National Laboratory (Center 
for Human Genome Studies) 

Los Alamos National Laboratory, Los 
Alamos, NM 87545, USA 

Tel.: (+1 505) 667 2746 

WWW: http://www-t10.lanl. gov/ 


Medical Research Council (UK) 

20 Park Crescent, London W1N 4AL, UK 
Tel: (+44 171) 636 5422 

Fax: (+44 171) 436 6179 

WWW: http://www.mrc.ac.uk 


MRC Mouse Genome Centre 
Harwell, Oxfordshire OX11 ORD, UK 


National Center for Biotechnology 
Information (NCBI) (USA) 
WWW: http:// www.ncbi.nlm.nih.gov/ 


National Center for Human Genome 
Research (NCHGR) (USA) 

National Institutes of Health, Building 
38A, Room 605, Bethesda, MD 20892, 
USA 

Tel.: (+1 301) 496 0844 

Fax: (+1 301) 402 0837 

WWW: http:// www.ncher.nih.gov / 


National Center for Genome Resources 
(NCGR) (USA) 
WWW: http:// www.ncgr.org / 


National Institutes of Health (NIH) 
(USA) 

Bethesda, MD 20892, USA 

WWW: http:// www.nih.gov/ 


sve 
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NIH molecular biology 
WWW: http://www.nih.gov/molbio 


The National Library of Medicine (NLM) 
(USA) 
WWW: http:// www.nlm.nih.gov/ 


National Institute for Medical Research 
(UK) 

The Ridgeway, Mill Hill, London NW7 
1AA, UK 

Tel.: (+44 181) 959 3666 

Fax: (+44 181) 906 4477 


Online Mendelian Inheritance in Man 
(OMIM) 

WWW: hittp://www3.ncbi.nim.nih. 
gov/Omim 

The on-line version of Mendelian 
Inheritance in Man. Selected tables of 
mapped human disease genes are 
reproduced with kind permission in 
Appendix VII of this book. 


Pasteur Institute 
WWW: http://www.pasteur.fr/welcome- 
uk. html 


Reference Library Database (RLDB) 
Max-Planck-Institiit fiir Molekulare 
Genetik, Ihnestrasse 73, 14195 Berlin- 
Dahlem, Germany 

Tel.: (+49 30) 8413 1627 

Fax: (+49 30) 8413 1395 

Www: http:/ /rldb.rz- 
berlin.mpg.de /main_e.html 


Sanger Centre 

Hinxton Hall, Hinxton, Cambridge 
CB10 1RQ, UK 

Tel.: (+44 1223) 834244 

Fax: (+44 1223) 1494919 

WWW: http:// Www.sanger.ac.uk/ 


Wellcome Trust Centre for Human 
Genetics 

Nuffield Department of Clinical 
Medicine, Windmill Road, Headington, 
Oxford OX3 7BN, UK 

Tel.: (+44 1865) 740015 

Fax: (+44 1865) 742187 


Whitehead Institute for Biomedical 
Research/MIT Center for Genome 
Research 

Cambridge, MA 02142, USA 
WWW: http://www- 


genome.wi.mit.edu 


Unité de Genetique Moleculaire Murine 
Institut Pasteur, 28 rue du Dr Roux, 
75724 Paris Cedex 15, France 

Tel.: (+33 1) 4568 8000 

Fax: (+33 1) 4568 8656 
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V.2 HUGO chromosome 
committees 


The role of the editors on the 
chromosome committees is to approve 
new genes using names approved by the 
Nomenclature Committee. They are also 
charged with maintaining the consensus 
maps of each chromosome and sorting 
out disputes over marker order, etc. They 
are expected to maintain the quality and 
integrity of the Genome Data Base. 


Chromosome 1 

Gail A.P. Bruns, Associate Professor, 
Children’s Hospital, Medical Center, 
Genetics Division, 300 Longwood Ave., 
Boston, MA 02115, USA 

Tel.: (+1 617) 355 7575 

Fax: (+1 617) 355 7588 

E-mail: Bruns@rascal.med.harvard.edu 


Tara Cox Matise, Columbia University, 
Department of Psychiatry, Unit 58, 722 
West 168th St., New York, NY 10032, 
USA 

Tel.: (+1 212) 960 2428 

Fax: (+1 212) 568 2750 

E-mail: tara@linkage.rockefeller.edu 


Peter S. White, Philadelphia, PA, USA 
Tel.: (+1 215) 590 4856 
E-mail: white@kermit.oncol.chop.edu 


Jeffrey M. Vance 

Durham, NC, USA 

Tel.: (+1 919) 684 6274 

Fax.: (+1 919) 684 6514 

E-mail: jett@dnadoc.mc.duke.edu 


Andreas Weith, Research Institute for 
Molecular Pathology, Dr. Bohr-Gasse 7, 
1030 Vienna, Austria 

Tel.: (+43 1) 7973 0625 

Fax: (+43 1) 798 7153 

E-mail: weith@aimp.una.ac.at 


Chromosome 2 

Friedhelm Hildebrandt, Freiburg, 
Germany 

Tel.: (+49 761) 270 4301 

Fax: (+49 761) 270 4481 

E-mail: hildebra@kk1200.ukl.uni- 
freiburg.de 


Mansoor Sarfarazi, Farmington, CT, USA 
Tel.: (+1 860) 679 3629 

Fax: (+1 860) 679 2451 

E-mail: msarfara@cortex.uchc.edu 


Erwin A. Schurr, Montreal, Quebec, 
Canada 

Tel.: (+1 514 937 6011) 

Fax: (+1 514) 933 7146 

E-mail: erwin@igloo.epi.mcgill.ca 


Seeeenceee 


Constantine Stratakis, Bethesda MD, 
USA 

Tel.: (+1 301) 496 0610 

Fax: (+1 301) 496 4686 

E-mail: stratakC@ccl.nichd.nih.gov 


Chromosome 3 

Benjamin Carritt, MRC Human 
Biochemical Genetics Unit, University 
College of London, Wolfson House, 4 
Stephenson Way, London NW1 2HE, UK 
Tel.: (+44 171) 380 7415 

Fax: (+44 171) 387 3496 

E-mail: b.carritt@mrc-hbgu.ucl.ac.uk 


Andreas Gal, Institiit fiir Humangenetik, 
MUL, Ratzenburger Allee 160, 23538 
Lubeck, Germany 

Tel.: (+49 451) 500 2622 

Fax: (+49 451) 500 4187 


Robert M. Gemmill, Eleanor Roosevelt 
Institute, 1899 Gaylord, Denver, CO 
80206, USA 

Tel.: (+1 303) 333 4515 

Fax: (+1 303) 333 8423 

E-mail: gemmill@loki.uchsc 


Susan L. Naylor, The University of Texas 
Health Science Center at San Antonio, 
Dept. of Cellular and Structural Biology, 
7703 Floyd Curl Drive, San Antonio, TX 
78284-7762, USA 

Tel.: (+1 210) 567 3842 

Fax: (+1 210) 567 6781 

E-mail: Naylor@uthscsa.edu 


Chromosome 4 

Michael Robert Altherr, Los 
Alamos, NM, USA 

Tel.: (+1 505) 665 6144 

Fax: (+1 505) 665 3024 

E-mail: altherr@telomere.lanl.gov 


Kenneth H. Buetow, Fox Chase Cancer 
Center, Division of Population Science, 
7701 Burholme Ave., Philadelphia, PA 
19111, USA 

Tel.: (41 215) 728 3152 

Fax: (+1 215) 728 3574 

E-mail: buetow@rudkin.rm.fcc.edu 
E-mail: jekarl@morgan.popgen.fccc.edu 


Jeffrey C. Murray, University of lowa, 
Cooperative Human Linkage Center, 431 
EMRB, Iowa City, IA 52242, USA 

Tel.: (+1319) 335 6946 

Fax: (+1 319) 335 6970 

E-mail: murray@uiowablue.weeg.edu 


Olaf, Riess, Bochum, Germany 
Tel.: (+49 234) 700 3831 

Fax: (+49 234) 700 4196 
olaf.riess@rz.ruhr-uni-bochum.de 


Gert-Jan B. van Ommen, University of 
Leiden, Department of Human Genetics, 
Sylvius Laboratory, PO Box 9503, 
Wassenaarseweg 72, 2300 RA Leiden, 
The Netherlands 

Tel.: (+31 71) 276 293 

Fax: (+31 71) 276 075 

E-mail: gvanomme@ruly46.Leidenuniv.nl 


Chromosome 5 

Michelle Le Beau, University of Chicago, 
Dept. of Medicine, Section of 
Hematology /Oncology, 5841 S. 
Maryland Ave., Chicago, IL 60637, USA 
Tel.: (+1 312) 702 0795 

Fax: (+1 312) 702 3163 

E-mail: 
mmlebeau@mcis.bsd.uchicago.edu 


John D. McPherson, Genome Sequencing 
Center/Genetics, Washington University 
School of Medicine, 4444 Forest Park 
Blvd., 4th Floor, St. Louis, MO 63108, 
USA 

Tel.: (+1 314) 286 1841 

Fax: (+1314) 286 1810 

Tel.: (+1 714) 824 7447 (Lab) and 824 6792 
(Lab) 

E-mail: jmcphers@watson.wustl.edu 


Chromosome 6 

Stephan Beck, Hinxton, Cambs, UK 
Tel.: (+44 1223) 834 244 

Fax: (+44 1223) 494 919 

E-mail: beck@sanger.ac.uk 


R. Duncan Campbell, University of 
Oxford, MRC Immunochemistry Unit, 
Dept. of Biochemistry, South Parks Road, 
Oxford OX1 3QU, UK 

Tel.: (+44 1865) 275 349 

Fax: (+44 1865) 275 729 

E-mail: rdcampbell@molbiol.ox.ac.uk 


Howard M. Cann, Centre d’Etude du 
Polymorphisme Humain (CEPH), 27 Rue 
Juliette Dodu, 75010 Paris, France 

Tel.: (+33 1) 4249 9862 

Fax: (+33 1) 4018 0155 

E-mail: howard@cephb.fr 


Elizabeth Jazwinska, Brisbane, 
Queensland, Australia 

Tel.: (+61 7) 3362 0179 

Fax: (+61 7) 3362 0191 

E-mail: lizJ/@qimr.edu.au 


Jiannis Ragoussis, London, UK 
Tel.: (+44 171) 955 4438 

Fax: (+44 171) 955 4444 

E-mail: i.ragoussis@umds.ac.uk 


Andreas Ziegler, Freie Universitat Berlin, 
Institute for Experimental Oncology and 
Transplantation Medicine, Spandauer 
Damm 130, 14050 Berlin, Germany 
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Tel.: (+49 30) 3035 2617 
Fax: (+49 30) 3035 3778 
E-mail: aziegler@ukrv.de 


Chromosome 7 

Helen R. Donis-Keller, St. Louis, 
MO, USA 

Tel.; (+1 314) 362 8629 

Fax: (+1 314) 362 8630 


Karl-Heinz Grzeschik, Med. Zentrum fur 
Humangenetik, Bahnhofstrasse 7, Abt. 
I—Aligemeine Humangenetik, 35037 
Marburg, Germany 

Tel.: (+49 642) 286232 

Fax: (+49 6421) 288920 

E-mail: grzeschi@mailer.uni-marburg.de 


Lap-Chee Tsui, Hospital for Sick 
Children, Dept. of Genetics, 555 
University Ave., Toronto, Ontario M5G 
1X8, Canada 

Tel.: (+1 416) 813 6015 

Fax: (+1 416) 813 4931 

E-mail: cfdata@sickkids.on.ca 


Chromosome 8 

Robin J. Leach, University of Texas 
Health Center at San Antonio, Dept. of 
Cellular and Structural Biology, 7703 
Floyd Curl Drive, San Antonio, TX 
78284-7762, USA 

Tel.: (+1 210) 567 6947 

Fax: (+1 210) 567 3803 

E-mail: Leach@UTHSCSA.edu 


Dan Wells, Houston, TX, USA 
Tel.: (+1 713) 743 2671 

Fax: (+1 713) 743 2636 

E-mail: dwells@uh.edu 


Stephen Wood, University of British 
Columbia, Department of Medical 
Genetics, 216 Wesbrook Building, 6174 
University Blvd., Vancouver BC V6T 
1Z3, Canada 

Tel.: (+1 604) 822 6830 

Fax: (+1 604) 822 5348 

E-mail: swood@unixg.ube.ca 


Chromosome 9 

Jonathan L. Haines, Massachusetts 
General Hospital, Neurogenetics 
Laboratory, Bldg. 149, 6th Floor, 13th St., 
Charlestown, MA 02129, USA 

Tel.: (+1 617) 724 9571 

Fax: (+1 617) 726 5736 

E-mail: haines@helix.mgh.harvard.edu 


Margaret Susan Povey, University 
College London, MRC Human 
Biochemical Genetics Unit, Wolfson 
House, 4 Stephenson Way, London NW1 
2HE, UK 

Tel.: (+44 171) 380 7410 

Fax: (44 171) 387 3496 

E-mail: sue@gallon.ucl.ac.uk 


COMO CHHEESOEOTHSSHCHEOUES SOROS OOH HOCeORERe RO SHUSSEEEECES 


Brandon Wainwright, Centre for 
Molecular Biology and Biotechnology, 
University of Queensland, Brisbane QLD 
4072, Australia 

Tel: (+61 7) 3654542 

Fax: (+61 7) 3717588 

E-mail: B. Wainwright@cmcb.uq.edu 


Jonathan Wolfe, University College 
London, Department of Genetics and 
Biometry, The Galton Laboratory, 
Wolfson House, 4 Stephenson Way, 
London NW1 2HE, UK 

Tel.: (+44 171) 387 7050 

Fax: (+44 171) 387 3496 

E-mail: jwolfe@genetics.ucl.ac.uk 


Chromosome 10 

Jen-i Mao, Collaborative Research 
Division, Genome Therapeutics Corp., 
Genome Sequencing Center, 100 Beaver 
St., Waltham, MA 02154, USA 

Tel.: (+1 617) 893 5007 

Fax: (+1 617) 642 0310 

E-mail: mao@genomecorp.com 


Nicholas Moschonas, IMBB-FORTH, 
P.O. Box 1527, 71110 Heraklion, Crete, 
Greece 

Tel.: (+30 81) 212 469 

Fax: (+30 81) 230 469 

E-mail: moschon@victor.imbb.forth. gr 


Nigel K. Spurr, Harlow, Essex, UK 
Tel.: (+44 1279) 622639 

Fax: (+44 1279) 622 500 

E-mail: N igel_K_Spurr@sbphrd.com 


Adrian R.N. Tivey, GDB Editorial 
Assistant, UK Human Genome Mapping 
Project, Resource Centre, Hinxton Hall, 
Hinxton, Cambs. CB10 1RQ, UK 

Tel.: (+44 1223) 494528 

Fax: (+44 1223) 494512 

E-mail: A. Tivey@hgmp.mrc.ac.uk 


Chromosome 11 

Patrick Gaudray, LGMCH, CNRS URA 
1462, Avenue de Valombrose, 06107 Nice, 
Cedex 2, France 

Tel.: (+33) 93 37 77 95 

Fax: (+33) 93533071 

E-mail: gaudray@hermes.unice.fr 


Daniela S. Gerhard, Washington 
University School of Medicine, Dept. of 
Genetics 4566 Scott Ave., Box 8232, St. 
Louis, MO 63110, USA 

Tel.: (+1314) 362 2736 

Fax: (+1 314) 362 7855 

E-mail: gerhard@sequencer.wustl.edu 


Charles W. Richard, WPIC, 3811 O’Hara 
St., Room 1445, Pittsburgh, PA 15253, 
USA 


Tel.: (+1 412) 624 1730 
Fax: (+1 412) 624 1754 
E-mail: richard+@pitt.edu 


Veronica van Heyningen, Medical 
Research Council Human Genetics Unit, 
Western General Hospital, Edinburgh, 
Scotland EH4 2XU, UK 

Tel.: (+44 131) 467 8405 

Fax: (+44 131) 343 2620 

E-mail: vervan@hgu.mre.ac.uk 


Chromosome 12 

Ian W. Craig, University of Oxford, 
Genetics Laboratory, Department of 
Genetics, South Parks Road, Oxford OX1 
3QU, UK 

Tel.: (+44 1865) 275 327 

Fax: (+44 1865) 275 318 

E-mail: craig@bioch.ox.ac.uk 

E-mail: icraig@hgmp.mrc.ac.uk 


Raju S. Kucherlapati, Dept. of Molecular 
Genetics, Albert Einstein College of 
Medicine, 1300 Morris Park Ave., Bronx, 
NY 10461, USA 

Tel.: (+1 718) 430 2069 

Fax: (+1718) 430 8778 

E-mail: kucherla@aecom.yu.edu 


Peter Marynen, Center for Human 
Genetics, University of Leuven, Campus 
Gasthuisberg, Herestraat 49, 3000 
Leuven, Belgium 

Tel.: (+32 16) 34 5891 

Fax: (+32 16) 345997 

E-mail: 
Peter.Marynen@med.kuleuven.ac.be 


Chromosome 13 

Sarah Shaw, La Jolla, CA, USA 
Tel.: (+1 619) 646 8281 

Fax: (+1 619) 452 6653 

E-mail: sarah@sequana.com 


Dorothy Warburton, Columbia- 
Presbyterian Medical Center, Babies 
Hospital, Room BHS-B7, 3959 Broadway, 
New York, NY 10032, USA 

Tel.: (+1212) 305 7143 

Fax: (+1 212) 305 7436 

E-mail: cuh@cuccfa.cce.columbia.edu 


Chromosome 14 

Diane W. Cox, Edmonton, Alberta, 
Canada 

Tel.: (+1 403) 492 0874 

Fax: (+1 403) 492 1998 

E-mail: diane.cox@ualberta.ca 


Torbjoern Nygaard, Columbia 
University, Dept. of N eurology, DB 3- 
330, 650 W. 168th St., New York, NY 
10032, USA 

Tel.: (+1 212) 305 1553 

E-mail: tgn1@columbia.edu 
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Chromosome 15 

Timothy A. Donlon, Chief, Molecular 
and Clinical Cytogenetics, Kapiolani 
Medical Center, Suite 400, 1946 Young 
St., Honolulu, HI 96826, USA 

Tel.: (+1 808) 973 8349 

Fax: (+1 808) 973 8053 

E-mail: 
Donlon@uhunix.uhcc.hawaii.edu 


Susan Malcolm, London, UK 

Tel.: (+44 171) 242 9789 

Fax: (+44 171) 404 6191 

E-mail: smalcolm@hgmp.mre.ac.uk 


Cynthia C. Morton, Brigham and 
Women’s Hospital, Dept. of Pathology, 
75 Francis St., Boston, MA 02115, USA 
Tel.: (+1 617) 732 7980 

Fax: (+1 617) 732 6996 

E-mail: 
CCMORTONG@BICS.BWH.HARVARD.E 
DU 


Chromosome 16 

Anne-Marie Cleton-Jansen, Leiden, 
The Netherlands 

Tel.: (+31 71) 526 6625 

Fax: (+31 71) 524 8158 

E-mail: 
clet@pathology.medfac.leidenuniv.nl 


David Frederick Callen, Head, 
Cytogenetics Unit, The Adelaide 
Children’s Hospital, Dept. of 
Cytogenetics and Molecular Genetics, 72 
King William Road, North Adelaide, SA 
5006, Australia 

Tel.: (+61 8) 2046715 

Fax: (+61 8) 204 7342 

E-mail: 
dcallen@dcallen.mad.adelaide.edu.au 


Norman A. Doggett, Los Alamos 
National Laboratory, Life Sciences 
Division and Center for Human Genome 
Studies, Mail Stop: M888, Los Alamos, 
NM 87545, USA 

Tel.: (+1 505) 665 4007 

Fax: (+1505) 667 7105 

E-mail: doggett@lanl.gov 


Chromosome 17 

Doron Lancet, Weizmann Institute of 
Science, Dept. of Membrane Research 
and Biophysics, 76100 Rehovot, Israel 
Tel.: (+972 8) 344 112 

Fax: (+972 8) 343 683 

E-mail: 
bmlancet@weizmann.weizmann.ac.il 


Jaime Prilusky, Bioinformatics Unit, 
Israel National Node—INN, Weizmann 
Institute of Science, PO Box 26, 76100 
Rehovot, Israel 


Tel.: (+972 8) 343 456 

Fax: (+972 8) 344 113 

E-mail: 
Isprilus@weizmann.weizmann.ac.il 


Ellen Solomon, Imperial Cancer 
Research Fund, Somatic Cell Genetics, 
PO Box 123, 44 Lincoln’s Inn Fields, 
London WC2A 3PX, UK 

Tel.: (+44 171) 269 3332 

Fax: (+44 171) 269 3469 

E-mail: e_solomon@icrf.ac.uk 


Chromosome 18 

Joan Overhauser, Thomas Jefferson 
University, Dept. of Biochemistry and 
Molecular Biology, Thomas Jefferson 
University, 233 South 10th St., 
Philadelphia, PA 19107, USA 

Tel.: (+1 215) 955 5188 

Fax: (+1 215) 923 9162 

E-mail: J_Overhauser@lac.jci.tju.edu 


Gary A. Silverman, Harvard Medical 
School of Pediatrics, Joint Program in 
Neonatology, 300 Longwood Ave., 
Enders-970, Boston, MA 02115, USA 

Tel.: (+1 617) 355 6416 

Fax: (+1 617) 355 7677 

E-mail: silverman_g@al.tch.harvard.edu 


Ad H.M. Geurts van Kessel, Catholic 
University of Nijmegen, Dept. of Human 
Genetics, Geert Grooteplein Zuid 20, 
6500 HB Nijmegen, The Netherlands 
Tel.: (+31 24) 361 4105 

Fax: (+31 24) 361 4107 

E-mail: A.GeurtsVankessel@antrg.azn.nl 


Chromosome 19 

Harvey Mohrenweiser, Lawrence 
Livermore National Laboratory, Biology 
and Biotechnology Research Program, 
7000 East Ave., Livermore, CA 94550, 
USA 

Tel.: (+1510) 423 0534 

Fax: (+1510) 422 2282 

E-mail: harvey@cea.lInl.gov 


Anne Olsen, Livermore, CA, USA 
Tel: (+1 510) 423 4927 

Fax: (+1 510) 422 2282 

E-mail: olsen2@IInI|.gov 


Chromosome 20 

Ingo Hansmann, Halle, Germany 
Tel.: (+49 345) 557 4291 

Fax: (+49 345) 557 4293 


Tim P. Keith, Collaborative Research, 
Inc., Dept. of Human Genetics and 
Molecular Biology, 1365 Main St., 
Waltham, MA 02154, USA 

Tel.: (+1 617) 893 5007 

Fax: (+1 617) 891 5062 

E-mail: tim.keith@genomecorp.com 


Chromosome 21 

Stylianos E. Antonarakis, University of 
Geneva School of Medicine, Medical 
Genetics CMU-9, 9 Avenue de Champel, 
1211 Geneva 4, Switzerland 

Tel.: (+41 22) 702 5707 

Fax: (+41 22) 702 5706 

E-mail: sea@medsun.unige.ch 


Jean Delabar, Paris, France 
Tel.: (+33 140) 61 5695 
Fax: (+33 140) 61 56 90 
E-mail: delabar@necker.fr 


Kathleen Gardiner, Denver, CO, USA 
Tel.: (+1 303) 333 4515 

Fax: (+1 303) 333 8423 

E-mail: gardiner@eri.uchsc.edu 


Julie Ruth Korenberg, Los Angeles, CA, 
USA 

Tel.: (+1 310) 855 7627 

Fax: (+1 310) 652 8010 

E-mail: jkorenberg@mailgate.csmc.edu 


David Patterson, Eleanor Roosevelt 
Institute, 1899 Gaylord St., Denver, CO 
80206-1210, USA 

Tel.: (+1 303) 333 4515, 

Fax: (+1 303) 333 8423 

E-mail: Davepatt@eri.uchsc.edu 


Roger H. Reeves, Baltimore, MD, USA 
Tel.: (+1 410) 955 6621 

Fax: (+1 410) 955 0461 

E-mail: rreeves@welchlink.welch.jhu. 
edu 


Nobuyoshi Shimizu, Keio University 
School of Medicine, Dept. of Molecular 
Biology, 35 Shinanomachi, Shinjuku-ku, 
Tokyo, 160, Japan 

Tel.: (+81 3) 3353 2370 

Fax: (+81 3) 3351 2370 

E-mail: shimizu@dmb.med.keio.ac.jp 


Bruno Urbero, Service de 
Bioinformatique, UMS 825 CNRS-SC 13 
INSERM, 7 rue Guy Moquet-BP 8, 94801 
Villejuif, Cedex, France 

Tel.: (433 1) 4559 5252 

Fax: (+33 1) 4559 5250 

E-mail: bruno@infobiogen. fr 


Christine Van Broeckhoven, 
Neurogenetics Lab, Born Bunje 
Foundation, University of Antwerp 
(UIA), Building T, room 5.35, 
Universiteits Plein 1, 2610 Antwerp, 
Belgium 

Tel.: (+32 3) 820 2601 

Fax: (+32 3) 820 2541 

E-mail: cvbroeck@reks.uia.ac.be 
E-mail: neurogen@reks.uia.ac.be 
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Chromosome 22 

Kenneth H. Buetow, Fox Chase Cancer 
Center, Division of Population Science, 
7701 Burholme Ave., Philadelphia, PA 
19111, USA 

Tel.: (+1 215) 728 3152 

Fax: (+1 215) 728 3574 

E-mail: buetow@rudkin.rm.fecc.edu 


Jan Dumanski, Dept. of Clinical 
Genetics, Karolinska Hospital L-6, 104 01 
Stockholm, Sweden 

Tel.: (+46 8) 729 3922 

Fax: (+46 8) 327 734 

E-mail: Jan.Dumanski@molmed.ki.se 


Beverly S. Emanuel, The Children’s 
Hospital of Philadelphia, Dept. of 
Human Genetics and Molecular Biology, 
10th Floor, Abramson Center, 34th St. 
and Civic Center Blvd., Philadelphia, PA 
19104, USA 

Tel.: (+1 215) 590 3856 

Fax: (+1215) 590 3764 

E-mail: beverly@mail.med.upenn.edu 


Chromosome X 

Andrea Ballabio, Telethon Institute of 
Genetics and Medicine, Via Olgettina 58, 
20132 Milano, Ita ly 

Tel.: (+39 2) 21560 206 

Fax: (+39 2) 21560 220 

E-mail: ballabio@tigem.it 


Anthony P. Monaco, Wellcome Trust 
Centre for Human Genetics, Windmill 
Road, Headington, Oxford OX3 7BN, UK 
Tel.: (+44 1865) 740 019 

Fax: (+44 1865) 742 186 

E-mail: anthony.monaco@well.ox.ac.uk 


David L. Nelson, Baylor College of 
Medicine, Institute for Molecular 
Genetics, One Baylor Plaza 902 E, 
Houston, TX 77030, USA 

Tel.: (+1 713) 798 3122 

Fax: (+1 713) 798 8854 

E-mail: nelson@bem.tmc.edu 


Bruno Urbero, Service de 
Bioinformatique, UMS 825 CNRS — oe 
13 INSERM, 7 rue Guy Moquet—BP 8, 
94801 Villejuif, Cedex, France 

Tel.: (+33 1) 4559 5252 

Fax: (+33 1) 4559 5250 

E-mail: bruno@infobiogen. fr, 


Chromosome Y 

Nabeel Affara, Cambridge University 
Dept. of Pathology, Tennis Court Road, 
Cambridge CB2 1QB, UK 

Tel.: (+44 1223) 333 700 

Fax: (+44 1223) 333 346 

E-mail: na@mole.bio.cam.ac.uk 


Michele Ramsay, South African Institute 
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for Medical Research, Dept. of Human 
Genetics, PO Box 1038, Johannesburg 
2000, Republic of South Africa 

Tel.: (+27 11) 489 9214 

Fax: (+27 11) 489 9226 

E-mail: 058mrams@chiron.wits.ac.za 


Mitochondrial DNA 

Marie T. Lott, Emory University School 
of Medicine, 1462 Clifton Road, Room 
403C, Atlanta, GA 30322, USA 

Tel.: (+1 404) 727 3337 

Fax: (+1 404) 727 3949 

E-mail: mtlott@gmm.gen.emory.edu 


Douglas C. Wallace, Chairman, Emory 
University Medical School, Genetics and 
Molecular Medicine, 1462 Clifton Road, 
Room 446, Atlanta, GA 30322, USA 

Tel.: (+1 404) 727 5624 

Fax: (+1 404) 727 3949 

E-mail: dwallace@gmm.gen.emory.edu 


Nomenclature Committee 

Claude Boucheix, Hospital Paul Brousse, 
INSERM U-268, Avenue Paul Vaillant 
Couturier, 94800 Villejuif, France 

Tel.; (+33 49) 581 068 

Fax: (+33 49) 581 085 

E-mail: boucheix@genome.vjf.inserm.fr 
E-mail: jasmin@arthur.citi2. fr 


Phyllis J. McAlpine, University of 
Manitoba, Dept. of Human Genetics, 250 
Old Basic Sciences Building, 770 
Bannatyne Ave,, Winnipeg, Manitoba 
R3E 0W3, Canada 

Tel.: (+1 204) 789 3393 

Fax: (+1 204) 786 8712 

E-mail: 
mcal@genmap.hgen.umanitoba.ca 


Joseph Nahmias, London, UK 

Tel.: (+44 171) 380 7777 

Fax: (+44 171) 387 3496 

E-mail: j nahmias@galton.ucl.ac.uk 


Margaret Susan Povey, University 
College London, MRC Human 
Biochemical Genetics Unit, Wolfson 
House, 4 Stephenson Way, London NW1 
2HE, UK 

Tel.: (+44 171) 380 7410 

Fax: (+44 171) 387 3496 

E-mail: sue@galton.ucl.ac.uk 


Thomas B. Shows, Roswell Park Cancer 
Institute, Dept. of Human Genetics, Elm 
and Carlton Streets, Buffalo, NY 14263, 
USA 

Tel.: (+1 716) 845 3108 

Fax: (+1716) 845 8449 

E-mail: tbs@shows.med.buffalo.edu 


Hester M. Wain, London, UK 
Tel.: (+44 171) 387 3496 
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Fax: (+44 171) 387 5096 
E-mail: h.wain@galton.ucl.ac.uk 


Julia A. White, University College of 
London, MRC Human Biochemical 
Genetics Unit, Wolfson House, 4 
Stephenson Way, London NW1 2HE, UK 
Tel.: (+44 171) 387 7050 

Fax: (+44 171) 387 3496 

E-mail: nome@galton.ucl.ac.uk 


V.3 Scientific journals, bulletin 
boards and other information 


The American Journal of Human Genetics 
Editor: Peter H. Byers 

The American Journal of Human 
Genetics, Dept. of Pathology, Box 357470, 
University of Washington, Seattle, WA 
98195-7470, USA 

Tel.: (+1 206) 685 9683 

E-mail: ajhd@u.washington.edu 


Annals of Human Genetics 

Editor (UK): David Hopkinson 
Department of Human Genetics, 
University College London, Wolfson 
House, 4 Stephenson Way, London NW1 
2HE, UK 


ARABIDOPSIS electronic newsgroup 
Information on all BIOSCI news groups 
and means of receiving messages can be 
obtained by anonymous ftp to net.bio. 
net in the folder pub/BIOSCI/ doc or by 
sending the message ‘help’ to 
biosci@daresbury.ac.uk (Europe, Africa 
and Central Asia) or biosci@net.bio.net 
(Americas and the Pacific rim). Do not 
send subscription messages to the list 
address. 


BIOSCI/BIONET-electronic news 
anonymous ftp to: net.bio.net 
gopher http://gopher.bio.net/ 


BIOSUPPLYNET (online directory of 
15 000 products and 1400 suppliers) 
WWW: http: // www.biosupplynet.com 


Cell 

Editor: Ben Lewin 

Cell, 1050 Massachusetts Avenue, 
Cambridge, MA 02138, USA 

Www: http://www.cell.com 

The tables of contents and abstracts for 
Cell, Immunity, and Neuron. 


Cell biology laboratory manual 
http://www.gac.edu/ cgi- 
bin/user/~cellab/ phpl?index-1.html 


Current Opinion in Genetics & 
Development 

Editors: Ron Laskey and Matthew P. 
Scott 
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3442 Cleveland Street, London W1P 
6LB, UK 

Tel.: (+44 171) 580 8377 

Fax: (+44 171) 580 8428 


Cytogenetics Cell Genetics 

Editor: Harold Klinger 

Department of Medical Genetics, Albert 
Einstein College of Medicine, 1300 
Morris Park Avenue, Bronx, New York 
NY 10641-1602, USA 


Genetic Linkage Bulletin Board 
gopher http: / /gopher.bio.net/11/ 
GENETIC-LINKAGE 


Genomics 

Editor: Victor McKusick 

Editorial Office, 525 B St., Suite 1900, San 
Diego, CA 92101-4495, USA 

Tel.: (+1 619) 699 6469 

Fax: (+1 619) 699 6859 


Human Molecular Genetics 

Editors: Kay Davies and Willard Hunt 
Dept. of Genetics, BRD 731, Case 
Western Reserve University, 2109 
Adelbert Rd., Cleveland, OH 44106-4955, 
USA 


Tel.: (+1 216) 368 0199 
Fax: (+1 216) 368 3030 
E-mail: HMGJournal@po.CWRU.edu 


Hum-Molgen (news in Bioscience and 
Medicine) 

WWW: http:/ /www.informatik.uni- 
rostock.de/ HUM-MOLGEN/ 
NewsGen/ 


Nature Genetics 

Editor: Kevin Davies 

1234, National Press Building, 
Washington, DC 20045, USA 
Tel.: (+1 202) 628 2513 

Fax: (+1 202) 628 1609 

E-mail: natgen@naturedc.com 


Nature 

Editor: Phil Campbell 

UK 

Porter’s South, 4 Crinan Street, London 
Ni 9XW, UK 

Tel.: (+44 171) 833 4000 

Fax: (+44 171) 843 45696/7 

E-mail: nature@nature.com 

WWW: http://www.nature.com 
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USA 

1234 National Press Building, 
Washington, DC 20045, USA 
Tel.: (+1 202) 737 2355 

Fax: (+1 202) 628 1609 

E-mail: nature@naturedc.com 


Science 

Editor-in-Chief: Floyd E. Bloom 

1200 New York Avenue, NW 
Washington, DC 20005, USA 

WWW: http./ /www.sciencemag.org 


Trends in Genetics 

Elsevier Trends Journals, 68 Hills Road, 
Cambridge CB2 1LA, UK 

Tel.: (+44 1223) 315961 

Fax: (+44 1223) 464430 


WWW Virtual Library: Biochemistry and 
Molecular Biology 

WWW: 
http://golgi.harvard.edu/sequences. 
html 


WWW Virtual Library: Biosciences 
WWW: 

http:/ /golgi-harvard.edu/biopages. 
html 
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Appendix VI Basic data on human 
and other genomes 


@véeeogaeeneseeaeseesas ee scoeosee ese eee ee soeese a 
Table V1.1 Chromosome numbers of common species. Table VI.2 Estimated sizes of the human chromosomes. 
Species Chromosome number Chromosome Length (Mb) 
Bacteria 1 263 
Escherichia coli 1 2 255 
Yeast : a 
Saccharomyces cerevisiae 16 S ee 
yces cerevis 5 194 
Schizosaccharomyces pombe 5 6 183 
Nematode i 171 
Caenorhabditis elegans 6 8 155 
9 145 
Insects 10 144 
Drosophila melanogaster 4 1 144 
Mammals 12 143 
Pig 19 13 114 
Cat 19 14 109 
Rabbit 22 15 106 
Human 23 16 98 
Sheep Zi, 17 o2 
Goat 30 18 85 
Donkey 31 19 67 
Horse 82 20 72 
Dog 39 Dal 50 
: 2D 56 
Birds x 164 
Chicken 39° Y 59 
Duck 40° J eel een ee ee 
Turkey 40° 
Plants 
Arabidopsis 5 
Rice 12 
Wheat 21 (A, B, D genomes) 


nn 


All mammalian and bird chromosome numbers are haploid. 
‘Including microchromosomes. 
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Table VI.3 Sizes of the Saccharomyces cerevisiae 


chromosomes. 


—S——— eee eee 


Chromosome Size (kb) 
I 250 

II 835 

I 360 

IV 1600 

V 580 

VI 280 

Vol 1125 

VII 580 

IX 450 

X 780 

XI 690 

Xi 1090 + rDNA (~2000 kb) 
XI 950 

XIV 810 

XV 1125 

XVI 970 


ee 


Table VI.4 Genome size and physical data. 


ee eee a 


Organism Estimated genome size (bp) 
Bacteria 

Escherichia coli 4.45 x 10° 
Bacillus megaterium 3x 10° 
Haemophilus influenzae 1.2 x 106 
Viruses 

SV40 5243 
Adenovirus 36 x 103 
Polyoma virus 5292 
Bacteriophage 

Lambda 48.5 x 103 
Fungi 

Saccharomyces cerevisiae 15 x 10° 
Schizosaccharomyces pombe 14. 106 
Invertebrates 

Caenorhabditis elegans 100 x 10° 
Drosophila melanogaster 165 x 10° 
Vertebrates 

Amphibians 

Xenopus laevis 2.9 x 10° 
Reptiles 1.6-5.1 x 10° 
Birds 

Chicken 1.125 x 10° 
Mammals 

Mouse (Mus musculus) 3.3 x 10° 
Rat (Rattus norvegicus) 3.0 x 10° 
Human 3.5 X 10° 
Plants 

Arabidopsis thaliana 1x 108 
Nicotiana tabacum 4.8x10° 


Oryza sativa (rice) 


4.3 x 108 


EE itl 
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Appendix Vil Catalogue of mapped 
human disease genes 


S@eeeee7e@eeeseseseoeseeeoeeseoeeseoanaeeoeenoneeasd 


Status’ C, confirmed; P, provisional; I, inconsistent (results of 
different laboratories disagree); L, in limbo (e.g. inferred by 
homology, correction of defect). 


MIM#? Each entry is given a six-digit number whose first digit 
indicates the mode of inheritance of the gene involved: 1, 
dominant; 2, recessive; 3, X-linked; 4, Y-linked; 5, mitochondrial. 


Method: Methods used for mapping the locus: A, in situ 
DNA-RNA or DNA-DNA hybridization; AAS, deduced from 
amino acid sequence of protein; C, chromosome-mediated gene 
transfer; Ch, chromosomal change associated with given 
phenotype; D, deletion or dosage mapping, trisomy mapping or 
gene dosage effects; EM, exclusion mapping; F, linkage studies in 
families; Fc, one trait is a chromosomal heteromorphism or 
rearrangement; Fd, one or both linked loci are identified by a 

DNA polymorphism; H, presumed homology; HS, solution 
hybridization; L, linkage to X-chromosome; LD, linkage 
disequilibrium; M, microcell-mediated gene transfer; OT, ovarian 


teratoma (centromere mapping); Pcm, PCR of microdissected 
chromosome segments; Psh, PCR of somatic cell hybrid DNA; R, 
irradiation of cells followed by rescue through fusion with 
nonirradiated (nonhuman) cells; RE, restriction endonuclease 
techniques (e.g. fine structure mapping); Rea, combined with 
somatic cell hybridization; Reb, combined with chromosome 
sorting; Rec, hybridization of cDNA to genomic fragment; Ref, 
isolation of gene from genomic DNA; Rel, isolation of gene from 
chromosome-specific genomic library; Ren, neighbour analysis 
in restriction fragments; S, segregation of human cellular traits 
and human chromosomes (or segments of chromosomes) in 
particular clones from interspecies somatic cell hybrids; T, 
telomere-associated chromosome fragmentation; V, induction of 
microscopically evident chromosomal change by a virus; e.g. 
adenovirus 12 changes on chromosomes 1 and 17; X/A, X- 
autosome translocation in female with X-linked recessive 
disorder. 

Tables by kind permission of Dr V. A. Mc Kusick. 


Table VII.1 The morbid anatomy of the human genome (by chromosome). 
ee 


Location Locus symbol Status’ Title MIM#° Methods Disorder(s) Mouse locus 
Ipter-p36.13 ENO1,PPH C Enolase-1, a 172430 S,E,R,R,REa_ Enolase deficiency (1) 4(Enol) 
Ipter-p33 HMGCL P 3-hydroxy-3-methyl-glutary] 246450 REa, A HMG-CoA lyase deficiency 
Coenzyme A lyase (3) 
1p36.3 MTHFR i Methylenetetrahydrofolate 236250 A Homocystinuria due 
reductase to MTHER deficiency (3) 
1p36.3-p36.2 PLOD Pr Procollagen-lysine, 2-oxoglutarate 153454 REa,A Ehlers—Danlos syndrome, 4(Plod) 
5-dioxygenase (lysine hydroxylase) type VI, 225400 (3) 
1p36.3-p34.1 CIQA c Complement component-1, 120550 REa, REb ?Cl1q deficiency (1) 
q subcomponent, o-polypeptide 
1p36.3-p34.1_ C1QB (G Complement component-1, q sub- 120570 REa, REb ?Clq deficiency (1) (Clqb) 


component, B-polypeptide 
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Table VII.1 Continued. 
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Location Locussymbol Status’ Title MIM#® Methods Disorder(s) Mouse locus 
1p36.2-p36.1 NB, NBS Ec Neuroblastoma (neuroblastoma 256700 Ch, D Neuroblastoma (2) 
suppressor) 
1p36.2-p34 EKV (e Erythrokeratodermia variabilis 133200 R Erythrokeratodermia 
variabilis (2) 
1p36.2-p34 EPB41, EL1 ‘S Erythrocyte surface protein band 4.1 130500 FE Reb Elliptocytosis-1 (3) 4(Elp1) 
1-36.2-p34 RH@ © Rhesus blood group cluster 111700 E,D,Fd,A Erythroblastosis fetalis (1); 
?Rh-null hemolytic 
anemia (1) 
1p36.1-p34 ALPL,HOPS C Alkaline phosphatase, liver/ 171760 S,H,Fd,F.A Hypophosphatasia, 4(Akp2) 
bone/kidney infantile, 241500 (3): 
?hypophosphatasia adult, 
146300 (1) 
1p36 BRCD2 P Breast cancer, ductal 211420 Ch, FD Breast cancer, ductal (2) 
1p36 CMM,MLM, P Cutaneous malignant melanoma / 155600 FE, Fd,D Malignant melanoma, 
DNS dysplastic naevus cutaneous (2) 
1p36-p35 CMT2 P Charcot-Marie-Tooth neuropathy-2 118210 Fd Charcot-Marie-Tooth 
(hereditary motor sensory disease, type II (2) 
neuropathy II) 
1p36-p35 GALE (Ss UDP galactose-4-epimerase 230350 S,LD Galactose epimerase 
deficiency (1) 
1p35-p34.3 CSF3R (e Colony-stimulating factor-3 receptor 138971 A,REb,Psh, Kostmann neutropenia, 
(granulocyte) Rea 202700 (3) 
1p34 FUCAI1 Cc Fucosidase, o-L- 1, tissue 230000 S,E,R,A,REa Fucosidosis (3) 4(Fuca) 
1p34 HUD,PNEM P HU-antigen D (a paraneoplastic 168360 A Paraneoplastic sensory 
encephalomyelitis antigen) neuropathy (1) 
1p34 UROD cE Uroporphyrinogen decarboxylase 176100 S, A, REa Porphyria cutanea tarda (3); | 4(Urod) 
porphyria, hepatoery- 
thropoietic (3) 
1p32 C8A € Complement component-8, 120950 FA, Ch, Fd C8 deficiency, type I (2) 
a-polypeptide 
1p32 C8B (€ Complement component-8, 120960 EA,Ch,H, Fd C8 deficiency, type II (3) 4(C8b) 
B-polypeptide 
1p32 CLN1 Cc Ceroid lipofuscinosis, neuronal-1, 256730 Fd,LD,REn — Ceroid lipofuscinosis, 
infantile neuronal-1, infantile (2) 
1p32 CPT2 Cc Carnitine palmitoyltransferase II 255120 REa, A Carnitine-palmitoyltrans- 
ferase II deficiency (3) 
1p32 DFNA2 Pp Deafness, autosomal non-syndromic 600101 Fd Deafness, autosomal non- 
sensorineural, 2 syndromic sensorineural, 2 (2) 
1p32 EDM2 P Epiphyseal dysplasia, multiple 2 600204 Fd Epiphyseal dysplasia, 
multiple 2 (2) 
1p32 TALL, TCLS. 6G T-cell acute lymphocytic leukaemia-1 187040 Ch, RE Leukaemia-1, T-cell acute 4(Scl) 
SCL lymphoblastic (3) 
1p31 ACADM, i Acyl-Coenzyme A dehydrogenase, C-4 201450 REa,A Acyl-CoA dehydrogenase, 8(Acadm) 
MCAD to C-12 straight chain medium chain, 
deficiency of (3) 
1p31 DBT, BCATE2 C Dihydrolipoamide branched chain 248610 REa, A Maple syrup urine disease, 
trans-acylase (E2 component of type Il (3) 
branched chain keto acid 
dehydrogenase complex) 
1p22.1-qter SDH P Succinate dehydrogenase 185470 S ?Myopathy due to succinate 
dehydrogenase 
deficiency (1) 
1p22 UOX P Urate oxidase 191540 REa, A Urate oxidase deficiency (1)  1p22-q21 
DPEYD DED VE Dihydropyrimidine dehydrogenase 274270 Rea Thymine-uraciluria (1); 
{fluorouracil toxicity 
sensitivity to} (1) 
1p22-p21 PXMP1, PMP70 P Peroxisomal membrane protein-1 170995 REa, A Zellweger syndrome-2 (3) 3(Pmp70) 
(70 kD) 
1p21 AGL, GDE P Amylo-1,6-glucosidase, 4-o-glucano- 232400 REc,A Glycogen storage 
transferase (glycogen debranching disease III (1) 
enzyme) 
1p21-p12 STGD,FFM  P Stargardt macular dystrophy 248200 Fd Stargardt macular 
dystrophy (2); fundus 
flavimaculatus with 
macular dystrophy (2) 
1p21-p13 AMPD1 lig Adenosine monophosphate 102770 REa, A Myoadenylate deaminase 3(Ampd1) 
deaminase-1 (muscle) deficiency (3) 
1p21-p13 CSF1,MCSF C Colony-stimulating factor-1 120420 A, REa, H ?Osteopetrosis, 259700 (1) 3(Csfm) 


(macrophage) 


Continued. 
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Table VII.1 Continued. 
a 


Locussymbol Status* Title 


Location 


Seeersesseceseussensceseesss 


Seeorereerescezer aes 


MIM# = Methods Disorder(s) Mouse lecus 
1p13.1 HSD3B2 [e Hydroxy-8-5-steroid dehydrogenase, 201810 A 3-B-hydroxysteroid 
3 B- and steroid 5-isomerase, type 2 4 a 3(Hsd3B2) 
(adrenal, gonadal) ep peunes lah 
ie ea ; eficiency (3) 
e ee . a cerebral hormone, 188540  REa,RE,Fd Hypothyroidism 3(Tshb) 
-polypeptide RNP SE 
Ipli-qil —- CDCD Po Cardi ial di eee ae ran 
ee dilated, with 115200 Fd Cardiomyopathy, familial 
dilated, with conduction 
é ‘ defect (2) 
1p11-qter EPHX1 P Epoxide hydroxylase 1, microsomal 132810 Rea ?Fetal hydantoin 1(Eph!1) 
(xenobiotic) syndrome (1); diphenylhy- 
dantoin toxicity (1) 
Ip Sate 12 Phaeochromocytoma 171300 D Phaeochromocytoma (2) Icen-q32 
FKM 12 Phosphofructokinase, muscle type 232800 S Glycogen storage : 
: Shes disease VIJ (3) 
1q FMO4,FMO2 =P Seliterd monooxygenase 2 136131 Psh [Fish-odor syndrome] (1) 
1q2 CAE1 C Cataract, zonular pulverulent-1 116200 F Cataract, zonular 
ae ae ‘ aaheen sd pulverulent-1 (2) 
q agerin 135940 REa,A,REn ?Ichthyosis vulgaris, 3(fig) 
, 146700 (1) 
1q21 GBA ¢ Glucosidase, f; acid 230800 S,A,D Gaucher disease (3) 3(Gba) 
1q21 PKLR, PK1 Cc Pyruvate kinase, liver and RBC type 266200 REa,A PK deficiency haemolytic 
anaemia (3) 
1q21 RCCP1 L Renal cell carcinoma, papillary, 1 179755 Ch ?Renal cell carcinoma, 
; papillary, 1 (2) 
1q21 SPTA1 ‘c Spectrin, a, erythrocytic-1 182860 REa, A, Fd Elliptocytosis-2 (3); 1(Spnal) 
pyropoikilocytosis (3); 
_ spherocytosis, recessive (3) 
q21-q22 FY, GPD G Duffy blood group 110700 EF, Fc, Fd, A {Vivax malaria, susceptibility 
to} (1) 
1q21-q23 APCS, SAP (e Amyloid P component, serum 104770 REa, A, Fd {?Amyloidosis, secondary, 1Gap) 
susceptibility to} (1) 
1q21-q31 GLCIA, Cc Glaucoma 1, open angle 137760 Fd Glaucoma, primary open 
POAG, angle, juvenile-onset (2) 
GPOA 
1q22 MPZ,CMT1B C Myelin protein zero 159440 REb, A, FE Charcot—Marie—Tooth 1(Mpp) 
Fd,D neuropathy slow, nerve 
conduction type Ib, 
118200 (3); Dejerine-Sottas 
disease, myelin P(0)- 
related, 145900 (3) 
1q23-q25 CD38Z; TCRZ7 ie CD3Z antigen, €-polypeptide 186780 REa,A,REn CD3,€-chain,deficiency(1)  1(13z,Cd3z) 
(TiT3 complex) 
1q22-q23 TPM3,NEM1 C Tropomyosin 3 (non-muscle) 191030 REa, A Nemaline myopathy-1, 1(Tpm3) 
q ony yopany Pp 
161800 (3) 
1q23 F5 (e Coagulation factor V (proaccelerin, 227400 REa,A,Fd, Factor V deficiency (1); 1(Cf5) 
labile factor) Ren protein C cofactor 
deficiency (3) 
1q23 FCGR3A, Ge Fc fragment of IgG, low affinity III, 146740 REb, REn Lupus erythematosus, 
CD16, receptor for (CD16) systemic, 152700 (1); 
IGFR3 neutropenia, immune (2) 
1q23 PBX1 (© Pre-B cell leukaemia transcription 176310 Ch,A Leukaemia, acute pre- 1(Pbx) 
factor-1 B-cell (2) 
1q23-q25 AT3 c Antithrombin II 107300 ED,A,REa Antithrombin Ii 1(At3) 
Fd, deficiency (3) 
1q23-q25 SELE,ELAM1 C Selectin E (endothelial leukocyte 131210 REn {Atherosclerosis, 1(Elam) 
adhesion molecule-1) susceptibility to} (2) 
1q23-q25 SELP,GRMP C Selectin P (granulocyte mem- 173610 REn,A Platelet «/6 storage pool 1(Grmp) 
brane protein, 140 kD; antigen deficiency (1) 
CD62) 
1q25 NCF2 Ee Neutrophil cytosolic factor-2 (65 kD) 233710 REa, A Chronic granulomatous 1(Nef2) 
due to deficiency of NCF-2 (1) disease 
1q25-q31 LAMC2, c Laminin, y2 (nicein (100 kD), (kalinin 150292 A, Fd Epidermolysis bullosa, 
LAMNB2, (105 kD), BM600 (100 kD)) Herlitz junctional type, 
LAMB2T 226700 (3) 
1q3 TNNT2, CMH2 C Troponin T2, cardiac 191045 REa, Fd Cardiomyopathy, familial 
hypertrophic, 2, 115195 (3) 
1q31 EBR2A P Epidermolysis bullosa 2A, junctional 226450 Fd, LD Junctional epidermolysis 
Herlitz bullosa inversa (2) 
1q31-q32.1 F13B Cc Coagulation factor XIII, B polypeptide 134580 Fd, A, RE Factor XIIB deficiency (3) 1(F13b) 


Continued on p.896. 
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Location Locus symbol Status* Title MIM# Methods Disorder(s) Mouse locus 
1q31-q32.1 RP12 P Retinitis pigmentosa-12 (autosomal 600105 Fd Retinitis pigmentosa-12, 
recessive) autosomal recessive (2) 
1q32 CACNLIA3, C Calcium channel, L type, o-1 114208 H,REa,A,Fd Hypokalaemic periodic 1(Cchl1a3, 
CCHL1A3 polypeptide, isoform-3 (skeletal paralysis, 170400 (3) mdg) 
muscle) 
1q32 CR1, C3BR GC Complement component (3b/4b) 120620 F,REa,A,RE CR! deficiency (1); 2SLE (1) 
receptor-1 
1q32 HF1, CFH Cc H factor-1 (complement) 134370 FE REa,RE,H Factor H deficiency (1); 1(Cfh) 
membroproliferative 
glomerulonephritis (1) 
1q32 MCP, CD46 ec Membrane cofactor protein (CD46, 120920 REa,A,REn {Susceptibility to measles} (1) 
trophoblast lymphocyte cross- 
reactive antigen) 
1q32 REN Cc Renin 179820 REa, A, D, Fd, [Hyperproreninaemia] (3) 1(Ren1) 
Ch 
1q32 VWS,LPS, PIT C van der Woude syndrome (lip pit 119300 Ch, Fd van der Woude syndrome (2) 
syndrome) 
1q41 RMD1 P Rippling muscle disease 1 600332 Fd Rippling muscle disease-1 (2) 
1q32 USH2A Le Usher syndrome 2A (autosomal 276901 Fd Usher syndrome, type 2 (2) 
recessive, mild) 
1q42 ADPRT,PPOL C ADP-ribosyltransferase NAD(+) 173870 REa,A ?Fanconi anaemia (1); 
?Xeroderma pigmentosum 
(1) 
1q42-q43 AGT Cc Angiotensinogen 106150 A, Rea (Hypertension, essential, 8(Agt) 
susceptibility to} (3); 
(Pre-eclampsia, suscep- 
tibility to} (3) 
1q42.1 FH e Fumarate hydratase 136850 S,R,D, Psh Fumarase deficiency (3) 
2p25.3 D2S448, MG50 P Melanoma associated gene 600134 A ?Melanoma (1) 
2p25 POMC Cc Proopiomelanocortin (adrenocor- 176830 REa ACTH deficiency (1) 12(Pomc1) 
ticotropin B-lipotropin) 
2p24 APOB € Apolipoprotein B (including Ag() 107730 REa, A Hypobetalipoproteinaemia 12(Apob) 
antigen) (3); abetalipoproteinaemia 
(3); hyperbetalipo- 
proteinaemia (3); 
apolipoprotein B-100, 
ligand-defective (3) 
2p24-p21 SPG4 Cc Spastic paraplegia-4 (autosomal 182601 Fd Spastic paraplegia-4 (2) 
dominant) 
2p23-p22 XDH Cc Xanthine dehydrogenase (xanthine 278300 REb, A Xanthinuria (1) 17(Xd) 
oxidase) 
2p21 HPE2, HPC L Holoprosencephaly-2, alobar or 157170 Ch ?Holoprosencephaly-2 (2) 
semilobar 
2p21 LHCGR Rr Luteinizing hormone/ 152790 A Precocious puberty, male, 
chorionogonadotropin 176410 (3) receptor 
2p21 SLC3A1, ATR1, C Solute carrier family 3 (cystine, dibasic 104614 REa, Fd, A Cystinuria, 220100 (3) 
D2H, NBAT and neutral amino acid transporters), 
member 1 
2p16-p15 COCA1, FCC1, P Colon cancer, familial, non-polyposis 120435 Fd Colon cancer, familial non- 
MSH2 type 1 polyposis, type 1 (3) 
2p16-p13 LGMD2B P Limb-girdle muscular dystrophy 2B 253601 Fd Muscular dystrophy, limb- 
(autosomal recessive) girdle, type 2B (2) 
2p13 TPO, TPX € Thyroid peroxidase 274500 REa, A, Fd Thyroid iodine peroxidase 12(Tpo) 
deficiency (1); goitre, 
congenital (3); hypothy- 
roidism, congenital (3) 
2p12 IGKC (e Immunoglobulin kappa constant 147200 REa,A [Kappa light chain 6(Igke) 
region deficiency] (3) 
2p12-p11.2 SFTP3 (e Pulmonary surfactant-associated 178640 REa, A Pulmonary alveolar 6(Sftp3) 
protein-3, 18 kD proteinosis, congenital, 
265120 (3) 
2q TBS, BCG L Mycobacterial infections, 209950 H, Fd {?Tuberculosis, suscep- 
susceptibility to tibility to} (2) 
2q12 ZAP70 GS Protein tyrosine kinase ZAP-70 (§- 176947 A Selective T-cell defect (3) 
associated protein 70 kD) 
2q13 NPH1 ( Nephronophthisis-1 (juvenile) 256100 Fd Nephronophthisis, 
juvenile (2) 
2q13-q14 PROC E€ Protein C (inactivator of coagulation 176860 REa, A Thrombophilia due to 
factors Va and VIIa) protein C. deficiency (3); 
purpura fulminans, 
neonatal (1) 
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Location Locus symbol Status Title MIM#> Methods Disorder(s) Mouse locus 
2q14-q21 LCO 1 Liver cancer oncogene 165320 REa,REb,A = ?Hepatocellular 
carcinoma (1) 
2q21 ERCC3,XPB € Excision-repair cross-complementing 133510 S,A Xeroderma pigmentosum, 
rodent repair deficiency, group B (3) 
complementation group 3 
2q21 LCT, LAC ic Lactase 223000 REa, Fd, A, ?Lactase deficiency, con- 
Psh genital (1); lactase defici- 
ency, adult, 223100 (1) 
2q31 COL3A1 (@ Collagen, type III, o-1 polypeptide 120180 REa, A Ehlers—Danlos syndrome, 
type IV, 1 (Col3al) 130050 
(3); aneurysm, familial, 
100070 (3); fibromuscular 
dysplasia of arteries, 
135580 (3); Ehlers-Danlos 
syndrome type ITI (3) 
2q31 GAD1 C Glutamate decarboxylase-1, brain 266100 REa,H,A,Psh ?Pyridoxine dependency 2(Gadi) 
(67 kD) with seizures (1) 
2q31 IDDM7 L Insulin-dependent diabetes mellitus 7 600321 H ?Diabetes mellitus, insulin- 
dependent, 7 (2) 
2q32 WSS P, Wrinkly skin syndrome 278250 Ch Wrinkly skin syndrome (2) 
2q33-q34 NDUFS1 iy NADH dehydrogenase (ubiquinone), 157655 A Lactic acidosis due to defect 
Fe-S protein-1 (75 kD) in iron—-sulphur cluster of 
complex I (1) 
2q33-q35 ALS2 1” Amyotrophic lateral sclerosis-2 205100 Fd Amyotrophic lateral sclerosis, 
(juvenile) juvenile (2) 
2q33-q35 CRYGA, (€ Crystallin, gamma A 123660 REa, A Cataract, Coppock-like (3) 1(Cryg1) 
CRYG1 
2q33-q36 CPS1 le Carbamoyl-phosphate synthetase 1, 237300 REa pee eat oi <3 
mitochondrial synthetase I deficiency (3 
2q33-qter (Qa ACIDS 12 Cytochrome P450, subfamily XXVIL 213700 Rea Cerebrotendinous 1(Cyp27) 
(sterol 27-hydroxylase) xanthomatosis (3) 
2q34 FN1 Ee Fibronectin-1 135600 S, REa,A ?Ehlers—Danlos syndrome, 1(Fn1) 
: type X (1) 
2q34 TCL4 P T-cell leukaemia /lymphoma-4 186860 Ch, RE eis /lymphoma, 
T-cell (2 
2q34-q35 ACADL, P Acyl-Coenzyme A dehydrogenase, 201460 A ae agit ees e 
i iency 0: 
LCAD long chain OS SUS OES as) 
2q35 DES P peaks 125660 REa, A ?Cardiomyopathy (1); ? 1(Des) 
a myopathy, desmino- 
pathic (1) 
; 1(Sp) 
i ti -3 193500 Ch, Fd, H, Waardenburg syndrome, 
2q35 PAX3, WS1, ( Paired box homeotic gene pen seme biG) pe said 
ee syndrome, type III, 148820 
(3); rhabdomyosarcoma, 
alveolar, 268220 (3) 
i ibili 1(Nramp) 
2q35 NRAMP P Natural resistance-associated 600266 Ren k Salsa ete ae (Nramp 
macrophage protein (might to TB, etc. 
include Leishmaniasis) 
i tsynd , autosomal 
2q36 COL4A3 (G Collagen IV, a-3 Se 120070 REa, A, RE sa ste cae a 
(Goodpasture antigen ye : 
2q36-q37 AGXT,SPAT P Alanine-glyoxylate aminotransferase, 259900 A, REa “hay primary, 
asia ae wage 120131 REa, A eens syndrome, autosomal 
: a, , 
ae a Soa * ela eae recessive, 203780 (3) 
i 2(Gcg) 
138030 REa, A [?Hyperproglucagonaemia] 
2q36-q37 GCG (S Glucagon a) 
113300 Ch, D ?Brachydactyly type E (2) 
2097 sich E gies babel dae i Brachydactyly-mental 
2 hydactyly-mental retardation 600430 Ch Nf 
2937 BMDR E erat . eee! retardation syndrome (2) 
De dovaginal perineoscrotal 
; -poly- 264600 REa Pseudovaginal p 
A2 P Steroid-5-c-reductase, a-poly: saa) 
ee see peptide-2 (3-oxo-5 o-steroid 6 hypospadias ( 
a eee or?) Crigler—Najjar syndrome, 1(Ugtl) 
Ch.2 UGT1A1, 1 UDP-glucuronosyltransferase-1 191740 REa na abn ae 
GNT a syndrome, 143500 (1) 
ippel-Lind 
3p26-p25 VHL (@ von Hippel-Lindau syndrome 193300 Fd, D, RE oe epi ; om ad an 
carcinoma (3) 
ister— drome (2) 
: 146510 Ch ?Pallister Hall syn 
3p25.3 PHS M Secmeicas Jae 253260 A Biotinidase deficiency (1) 
3p25 BID i? Biotinidase 
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Location Locus symbol Status? Title MIM#> Methods Disorder(s) Mouse locus 
3p25 XPC, XPCC iC Xeroderma pigmentosum, 278720 REa, A, RE Xeroderma pigmentosum, 
complementation group C complementation 
group C (3) 
3p24.3 THRB, oe Thyroid hormone receptor, B (avian 190160 REa, A, RE, Thyroid hormone resistance, 
ERBA2, erythroblastic leukaemia viral Fd 274300, 188550 (3) 
THR1 (v-erb-a) oncogene homologue-2) 
3p24-p21 SCN5A,LOQT3 C Sodium channel, voltage-gated, 600163 Fd,A Long QT syndrome-3 (3) 
type V, a polypeptide 
3p23-p22 ACAA 1p Acetyl-Coenzyme A acyltransferase 261510 REa, A Pseudo-Zellweger 
(peroxisomal 3-oxoacyl-Coenzyme syndrome (1) 
A thiolase) 
3p23-p21 SCLG1 (E Small-cell cancer of lung 182280 Ch,D Small-cell cancer of lung (2) 
3p22-p21.1 PTHR E Parathyroid hormone receptor 168468 REa, A, Psh Metaphyseal chondro- 
dysplasia, Murk Jansen 
type, 156400 (3) 
3p21.33 GLB1 G Galactosidase, B-1 230500 S,EM, A GM1-gangliosidosis (3); 9(Bgl) 
mucopolysaccharidosis 
IVB (3) 
3p21.3 COL7A1 G Collagen VII, o-1 polypeptide 120120 REa, A Epidermolysis bullosa 
dystrophica, dominant, 
131750 (3); epidermolysis 
bullosa dystrophica, 
recessive, 226600 (3) 
3p21.3 MLH1, COCA2 P mutL (E. coli) homologue 1 120436 Fd,A Colorectal cancer, familial 
non-polyposis type 2 (3); 
Turcot syndrome with 
glioblastoma, 276300 (3) 
3p21.2-p21.1 AMT P Aminomethyltransferase (glycine 238310 REa Hyperglycinaemia, non- 
cleavage system protein T) ketotic, type II (1) 
3p21.1-p12 SCA7,OPCA3 C Spinoeerebellar ataxia 7 (olivopon- 164500 Fd Cerebellar ataxia with retinal 
tocerebellar atrophy with retinal degeneration (2) 
degeneration) 
3p14.3 TKT P Transketolase 277730 REa, A {Wernicke—Korsakoff 
syndrome, susceptibility 
to} (1) 
3p14.2 RCAI, HRCAI C Renal carcinoma, familial, associated 1 144700 Fc, Ch Renal cell carcinoma (2) 
3p14.1-p12.3 MITE WS2A C Microphthalmia-associated 156845 REa, A, Fd Waardenburg syndrome, 6(mi) 
transcription factor type 2A, 193510 (3) 
3p13-p12 BBS3 12 Bardet—Bied] syndrome 3 600151 Fd Bardet-Bied1 syndrome 3 (2) 
3p12 GBE1 P Glycogen branching enzyme 232500 REa Glycogen storage disease 
IV (1) 
3p11.1-q11.2 PROS1 (e Protein S, o 176880 REa Protein S deficiency (3) 
3p11 PIT1 (€ Pituitary-specific transcription 173110 Fd,A Pituitary hormone deficiency, 16(Pit1,dw) 
factor Pit-1 combined (3) 
3cen-q22 MER6,RHN  P Antigen identified by monoclonal 268150 S Rh-null disease (1) 
antibody 1D8 (Rh-null, regulator 
type) 
3q11-q12 GPX1 S Glutathione peroxidase-1 138320 S, Rea Haemolytic anaemia due to 
glutathione peroxidase 
deficiency (1) 
3q12 CPO P Coproporphyrinogen oxidase 121300 REa, A Coproporphyria (3); 
harderoporphyrinuria (3) 
3q13 FIH L Hypoparathyroidism 164200 Fd Hypoparathyroidism, 
familial (2) 
3q13 UMPS,OPRT C Uridine monophosphate synthetase 258900 S,A oroticaciduria (1) 
(orotate phosphoribosyl transferase 
and orotidine-5 (fm-decarboxylase) 
3q13.3 DRD3 P Dopamine receptor D3 126451 REb, A {?Schizophrenia, suscep- 
tibility to} (2) 
3q2 AKU c Alkaptonuria 203500 Fd,H Alkaptonuria (2) 16(aku) 
3q21 TF (e Transferrin 190000 S,H,Rea,D,A Atransferrinaemia (1) 9(Trf) 
3q21-q22 PCCB (G Propionyl Coenzyme A carboxylase, 232050 REa, A, D Propionicacidaemia, type II 
B-polypeptide or pecB type (3) 
3q21-q23 LIF Cc Lactotransferrin 150210 REa,A ?Lactoferrin-deficient 9(Ltf) 
neutrophils, 245480 (1) 
3q21-q24 (ele G Ceruloplasmin 117700 F,H,REa,A —_[Hypoceruloplasminaemia,  9(Cp) 
hereditary] (1) 
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Location Locus symbol Status’ Title MIM# = Method Disorder(s) Mouse locus 
3q21-q24 HHCiPHH,. P. Hypocalciuric hypercalcaemia-1 145980 Fd, Rea Hypercalcaemia, hypo- 
PCAR1 (parathyroid Ca(2+)-sensing calciuric, familial (3); 
receptor) neonatal hyperparathy- 


roidism, 239200 (3); hypo- 
calcaemia, autosomal 
dominant (3) 

3q21-q24 RHO, RP4 Cc Rhodopsin 180380 REa, A, Fd Retinitis pigmentosa-4, 6(Rho) 
autosomal dominant (3); 
retinitis pigmentosa, auto- 
somal recessive (3); night 


blindness, congenital 
stationary, rhodopsin- 
related (3) 
3q21-q25 USH3 P Usher syndrome-3 276902 Fd Usher syndrome, type 3 (2) 
3q22-q23 BPES Cc Blepharophimosis, epicanthus inversus 110100 Ch, Fd Blepharophimosis, epican- 
and ptosis thus inversus and 
ptosis (2) 
3q25-q26 SI iB Sucrase-isomaltase 222900 REa, A, Fd Sucrose intolerance (1) 3(Sis) 
3q26 MDS1 Pp Myelodysplasia syndrome-1 600049 Ch, REc, A Myelodysplasia syndrome-1 
(3) 
3q26-qter KNG Cc Kininogen 228960 Psh, A [Kininogen deficiency] (3) 
3q26.1-q26.2 BCHE,CHE1 C Butyrylcholinesterase 177400 ED,A Apnoea, postanaesthetic (3) 
3q26.3 CDL L Cornelia de Lange syndrome 122470 Ch ?Cornelia de Lange 
syndrome (20 
3q26.3-q28 EHHADH, iP Enoyl-Coenzyme A, hydratase/ 261515 A Peroxisomal bifunctional 
PBFE 3-hydroxyacyl Coenzyme A enzyme deficiency (1) 
dehydrogenase 
3q27 BCL6 P B-cell CLL/lymphoma-6 109565 Ch,A Lymphoma, B-cell (2); lymp- 
homa, diffuse large cell (3) 
3q28-q29 HRG ic Histidine-rich glycoprotein 142640 REa, A, Fd ?Thrombophilia due to 
elevated HRG (1) 
3q28-qter OPA1 P Optic atrophy 1 (autosomal dominant) 165500 Fd Optic atrophy 1 (2) 
Chr.3 TRH P Thyrotropin-releasing hormone 275120 REa Thyrotropin-releasing 
hormone deficiency (1) 
4p16.3 FGFR3,ACH C Fibroblast growth factor receptor-3 134934 REn, Fd Achondroplasia, 100800 (3);  5(Fgfr3) 
hypochondroplasia, 
146000 (3) 
4p16.3 HD) IT15 (© Huntingtin 143100 Fd Huntington's disease (3) 5(Hdh) 
4p16.3 IDUA, IDA P Iduronidase, o-L- 252800 REa, A,S Mucopolysaccharidosis 5(Idua) 
Th (3); mucopoly- 
saccharidosis Is (3); muco- 
polysaccharidosis [h/s (3) 
4p16.3 PDEB, CSNB3_ C Phosphodiesterase, cyclic GMP (rod 180072 REa, A, Fd Night blindness, congenital  5(Pdeb, rd) 
receptor) §-polypeptide stationary, type 3, 163500 
(3) 
4p16.3 WHCR Cc Wolf-Hirschhorn syndrome 194190 Ch Wolf-Hirschhorn syndrome 
chromosome region (2) 
4p16.1 MSX1, HOX7 P msh (Drosophila) homeobox 142983 REa,A,D,Fd ?Wolf—Hirschhorn syn- 5(Hox7) 
homolog 1 (formerly homeobox 7) drome, 194190 (3) 
4p16-p14 CDPR L Chondrodysplasia punctata, 215100 Ch ?Chondrodysplasia punctata, 
rhizomelic rhizomelic (2) 
4p15.31 QDPR,DHPR C Quinoid dihydropteridine reductase 261630 S,A,REa,D _ Phenylketonuria due to 5(Qdpr) 
dihydropteridine 
reductase deficiency (3) 
4p13-q12 TAPVR1 2 Total anomalous pulmonary 106700 Fd Total anomalous pulmonary 
venous return venous return (2) 
4p WERS RB Wolfram syndrome 222300 Fd Wolfram syndrome (2) 
4q11-q13 AFP,HPAFP  C a-fetoprotein 104150 H,A,Fd,F [AFP deficiency con- 5(Afp) 
genital] (1); [hereditary 
persistence of a- 
fetoprotein] (3) 
4q11-q13 ALB (@ Albumin 103600 F,A, REa Analbuminaemia (3); [dysal- 5(Alb1) 
buminaemic hyperthy- 
roxinemia] (3); [dysalbu- 
minaemic hyperzincaemia] 
(3) 
11-q13 D P Periodontitis, juvenile 170650 F Periodontitis, juvenile (2) 
is : ‘a PBT (e mia cpSritckertla 4 feline sarcoma 164920 REa, A, H, Piebaldism (3); 5(Kit; W) 


Ch, H, Ren Mast cell leukaemia (3) 
Dentinogenesis imperfecta-1 


(2) 
ee 
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Location Locus symbol Status Title MIM#° Method: Disorder(s) Mouse locus 
4q21 IGJ P Immunoglobulin J polypeptide, 147790 REa, A ?Leukaemia, acutelympho- —_5(Igj) 
linker protein for cytic, with 4/11 trans- 
location (3) 
4q21-q23 GNPTA P. UDP-N-acetylglucosamine-lysosomal- 252500 Ite) D) Mucolipidosis II (1); 
enzyme N-acetylglucosamine mucolipidosis III (1) 
phosphotransferase 
4q21-q23 PKD2,PKD4 C Polycystic kidney disease-2 173910 Fd Polycystic kidney disease, 
(autosomal dominant) adult, type II (2) 
4q25 IF e I factor (complement) 217030 REa, Fd, A, RE C3b inactivator deficiency (1) 
4q25-q27 RGS c Rieger syndrome 180500 Ch, Fd Rieger syndrome (2) 
4q26-q27 IL2 (ce Interleukin-2 147680 REa, A, F Severe combinedimmuno- _—_3(I12) 
deficiency due to IL2 
deficiency (1) 
4q28 FGA € Fibrinogen, o-polypeptide 134820 RE,REa,H, —_ Dysfibrinogenaemia, a- 
D,LD,A types (3); amyloidosis, 
hereditary renal, 105200 
(3) 
4q28 FGB ( Fibrinogen, B-polypeptide 134830 RE,REa,D, _ Dysfibrinogenaemia,a-types 
LD,A (3) 
4q28 FGG e Fibrinogen, y-polypeptide 134850 F, REa,H,RE, Dysfibrinogenaemia, gamma 3(Fgg) 
D,LD,A types (3); hypofibrino- 
genaemia, gamma types (3) 
4q28-q31 ASMD re. Anterior segment mesenchymal 107250 F Anterior segment mesen- 
dysgenesis chymal dysgenesis (2) 
4q28-q31 TYS Cc Sclerotylosis 181600 F Sclerotylosis (2) 
4q31.1 MLR, MCR Cc Mineralocorticoid receptor 264350 REa, M, A Pseudohypoaldosteronism (1) 
(aldosterone receptor) 
4q32-q33 AGA Cc Aspartylglucosaminidase 208400 SF DA Aspartylglucosaminuria (3) 
4q32-qter ETFDH P Electron transfer flavoprotein: 231675 REa, A Glutaricacidaemia, type IIC (3) 
ubiquinone oxidoreductase 
4q32.1 HVBS6 P Hepatitis B virus integration site-6 142380 REa,A,D Hepatocellular carcinoma (3) 
4q35 Fil (€ Coagulation factor XI (plasma 264900 A,H,Fd Factor XI deficiency (3) 8(cf11) 
thromboplastin antecedent) 
4q35 FSHMD1A, G Facioscapulohumeral muscular 158900 Fd Facioscapulohumeral 28(myd) 
FSHD dystrophy 1A muscular dystrophy 1A (2) 
4q35 KLK3 P, Kallikrein, plasma (Fletcher factor) 229000 A Fletcher factor deficiency (1) 8(Kal3) 
Chr.4 LAG5 RP Leukocyte antigen group 5 151450 Ss Neutropenia, neonatal 
alloimmune (1) 
5p13 C6 Cc Complement component-6 217050 A,H,RE,Fd, C6 deficiency (1); Combined 15(C6) 
LD C6/C7 deficiency (1) 
5p13 Gi € Complement component-7 217070 A,H,RE,Fd, C7 deficiency (1) 15(C7) 
LD 
5p13 @ c Complement component-9 120940 REa, A, Fd, C9 deficiency (1) 
LD 
5p13-p12 GHR Cc Growth hormone receptor 262500 REa, A Laron dwarfism (3) 15(Ghr) 
5p13-q12 BBBG L Hypospadias-dysphagia syndrome 145410 Ch ?Hypospadias-dysphagia 
(Opitz BBBG syndrome) syndrome (2) 
5q11-q13 ARSB (S Arylsulphatase B 253200 S Maroteaux—Lamy syndrome, 13(As1) 
several forms (3) 
5q11.2 KES IL; Klippel—Feil syndrome 214300 Ch ?Klippel—Feil syndrome (2) 
5q11.2-q13.2 DHFR G Dihydrofolate reductase 126060 S,REa,H,D ?Anaemia, megaloblastic, 13(Dhfr) 
due to DHER deficiency (1) 
5q11.2-q13.3 SCZD1 L Schizophrenia disorder-1 181510 Ch, Fd ?Schizophrenia (2) 
5q12-q32 MAR P Macrocytic anaemia, refractory 153550 Ch Macrocytic anaemia of 
5q-syndrome, refractory (2) 
5q12.2-q13.3 SMA (e Spinal muscular atrophy 253300 Fd Werdnig—Hoffmann disease 
(2); spinal muscular 
atrophy II (2); spinal 
muscular atrophy III (2) 
5q13 HEXB € Hexosaminidase B (B-polypeptide) 268800 S,Ch, D Sandhoff disease (3) 13(Hex2) 
5q13.3 RASA, GAP Cc RAS p21 protein activator (GTPase 139150 REa, A Basal cell carcinoma (3) 13(Gap) 
activating protein) 
5q21 MCC € Mutated in colorectal cancers 159350 REn, D Colorectal cancer (3) 18(Mcc) 
5q21-q22 APC,GS,FPC C Adenomatous polyposis coli 175100 D, Fd, REn Gardner syndrome (3); 18(Min, Apc) 
polyposis coli, familial (3); 
colorectal cancer (3) 
5q22-q33.3 CDGG1 P Corneal dystrophy, Groenouw type I 121900 Fd Corneal dystrophy, Gro- 
enouw type I (2); corneal 
dystrophy, lattice type I, 
122200(2); corneal 
dystrophy, combined 


granular/lattice type (2) 


Aa ace ee Se eee 
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Locus symbol Status? Title MIM# Method Disorder(s) Mouse locus 
5q22.3-q31.3 LGMD1 P Limb-girdle muscular dystrophy, 159000 = -—Fd Muscular dystrophy, limb- 
autosomal dominant girdle, autosomal 
, dominant (2) 
5q23 DTS,HBEGF C Diphtheria toxin sensitivity (heparin- 126150 S,M {Diphtheria, susceptibility to} 
binding EGF-like growth factor) (1) 
5q23-q31 FBN2,CCA P, Fibrillin-2 121050 Fd,A Contractural arachnodactyly, 18(Fbn2) 
congenital (3) 
5q23-q31 ITGA2, id Integrin, a-2 (CD49B; a-2 subunit 192974 S,Psh,A Neonatal alloimmunethrombo- 
CD49B, BR of VLA-2 receptor; platelet antigen cytopenia (2); 
Br) ?glycoprotein Ia deficiency 
(2) 
5q31 GRL c Glucocorticoid receptor, lymphocyte 138040 S, REa, Fd, Cortisol resistance (3) 18(Grl1) 
H,A,D, 
REn 
5q31-q33 DFNA1, ip Deafness, autosomal dominant-1 124900 Fd Deafness, low-tone (2) 
LFHL1 
5q31-q34 DTD P Diastrophic dysplasia 222600 Fd Diastrophic dysplasia (3) 
5q31.1 IRF1 G Interferon regulatory factor-1 147575 ~—- Fd, REa, A, Macrocytic anaemia refrac- 11 (Irf1) 
Ren, D tory, of 5q-syndrome, 
153550 (3); myelodys- 
plastic syndrome, 
preleukaemic (3); myelo- 
genous leukaemia, acute (3) 
5q31.3-q33.1 GM2A (c GM2 ganglioside activator protein 272750  §,REa,Psh,A GM2-gangliosidosis, AB 
variant (3) 
5q32 GLRA1,STHE C Glycine receptor, alpha 1 138491 Fd,R,A Startle disease /hyperef- 11(spd) 
plexia, autosomal domi- 
nant, 149400 (3); startle 
disease, autosomal 
recessive (3) 
5q32-q33.1 TCOF1,MFD1 C Treacher Collins—Franceschetti 154500 Ch, Fd Treacher Collins mandibulo- 
syndrome-1 facial dysostosis (2) 
5q33-qter F12, HAF ( Coagulation factor XII (Hageman factor) 234000 REa,A Factor XII deficiency (3) 
5q34-q35 MSX2, Cc msh (Drosphila) homeobox homologue 2 123101  Fd,REa,A Craniosynostosis, type 2 (3) 
CRS2, HOX8 
6pter-p22 SCZD3 P Schizophrenia disorder 3 600511 Fd Schizophrenia-3 (2) 
6p24.3 OFC1, CL G Orofacial cleft-1 (cleft lip with or 119530 = Fd, Ch Orofacial cleft (2) 
without cleft palate; isolated cleft 
palate) 
6p25-p24 F13A1,F13A C Coagulation factor XIII, A polypeptide 134570 EFd,A,D Factor XIIIA deficiency (3) 
6p23 D6S231E, DEK P DEK gene 125264 Ch Leukaemia, acute non- 
lymphocytic (2) 
6p23 SCA1 G Spinocerebellar ataxia 1 (olivoponto- 164400 EFd,A Spinocerebellar ataxia-1 (3) 
cerebellar ataxia 1, autosomal 
dominant) 
6p22-p21.3 STL2 P, Stickler syndrome, type 2 184840 Fd Stickler syndrome, type 2 (2) 
6p22-p21 BCKDHB, E1B C Branched chain keto acid dehydro- 248611  REa,A Maple syrup urine disease, 
genase E1, B-polypeptide type 3 (3) 
6p21.3 AS, ANS P. Ankylosing spondylitis 106300 EFd Ankylosing spondylitis (2) 
6p21.3 ASD2 le Atrial septal defect, secundum type 108800 F Atrial septal defect, 
secundum type (2) 
6p21.3 C2 IS Complement component-2 217000 §F,LD,RE C2 deficiency (3) 17(C2) 
6p21.3 C4A,C4S 1S Complement component-4A 120810 FH, RE, Fd C4 deficiency (3) 17(CA4) 
6p21.3 C4B, C4F ie Complement component-4B 120820 &H,RE,Fd C4deficiency (3) 17(C4) 
6p21.3 COL11A2 G Collagen XI, o-2 polypeptide 120290 REa,A,REn, Stickler syndrome, type I, 17 (Coll 
Fd 184840 (3); OSMED la2) 
syndrome, 215150 (3) 
6p21.3 CYP21,CA21H C Cytochrome P450, subfamily XXI; 201910 FRE Adrenal hyperplasia con- 17(Cyp21) 
steroid 21-hydroxylase genital, due to 21-hydroxy- 
lase deficiency (3) 
6p21.3 DYLX2,DLX2 P Dyslexia, specific, 2 600202 Fd Dyslexia, specific, 2 (2) 
6p21.3 EJM1, JME P Epilepsy, juvenile myoclonic-1 254770 EFd a juvenile myoclonic 
6p21.3 GLYS1 P. Renal glucosuria-1 233100 F [Renal glucosuria] (2) 
6p21.3 HFE (C Haemochromatosis 235200 LD,F Haemochromatosis (2) ; 
6p21.3 HLA-DPB1 ( Major histocompatibility complex, 142858 ERE {Beryllium disease, chronic, 
class II, DP B-1 susceptibility to} (3) 
6p21.3 IDDM1 1 Insulin-dependent diabetes mellitus-1 222100 ELD ?Diabetes mellitus, insulin- 
dependent-1 (2) 
6p21.3 NEU I Neuraminidase 256550 H,F ?Sialidosis (2) 17(Neu1) 
6p21.3 PDB Mi Paget disease of bone 167250 F ?Paget disease of bone (2) 


a 
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PCARTHOTCAH SHOE OSHHEHRSES SHH SSESHTES HOPE EHO EHEE SHH OSEHEODES 


Location Locus symbol Status’ Title MIM#> Methods Disorder(s) Mouse locus 
6p21.3 RP14 1? Retinitis pigmentosa-14 600132 Fd Retinitis pigmentosa-14 (2) 
(autosomal recessive) 
6p21.3 RWS ib, Ragweed sensitivity 179450 F ?Ragweed sensitivity (2) 
6p21.3 TAP2, (e Transporter-2, ABC (ATP-binding 170261 REn Bare lymphocyte syndrome, 17(Ham2) 
RINGI11, cassette) type I, due to TAP2 
PSF2 deficiency (1) 
6p21.3-p21.2_ LAP ie Laryngeal adductor paralysis 150270 F ?Laryngeal adductor 
paralysis (2) 
6p21.1-p12 PKHD1 (€ Polycystic kidney and hepatic disease-1 263200 Fd Polycystic kidney disease, 
ARPKD (autosomal recessive) autosomal recessive (2) 
6p21.1-pcen RDS, RP7 (e Retinal degeneration, slow (peripherin) 179605 REa,A Retinitis pigmentosa, peri- —_—-17(rds) 
pherin related (3); retinitis 
punctata albescens (3); 
macular dystrophy (3); 
retinitis pigmentosa, 
digenic (3); butterfly 
dystrophy, retinal (3) 
6p21 CCD G Cleidocranial dysplasia 119600 Ch, Fd, D Cleidocranial dysplasia (2) 
6p21 MUT, MCM G Methylmalonyl Coenzyme A mutase 251000 REa, A, EF D Methylmalonicaciduria, 17(Mut) 
mutase deficiency type (3) 
6p ICS1 1 Immotile cilia syndrome 1242650 =F ?Immotile cilia syndrome (2) 6p 
PUJO P Pelviureteric junction obstruction 143400 8 Pelviureteric junction 
obstruction (2) 
6cen-ql4 STGD3 P Macular dystrophy with flecks, type3 600110 Fd Stargardt disease 3 (2) 
6q SIASD,SLD P Sialic acid storage disease 269920 Fd Salla disease (2) 
6q13-q15 OA3, OAR L Ocular albinism, autosomal recessive 203310 Ch ?Ocular albinism, autosomal 
recessive (2) 
6q14-q16.2 MCDR1 12 Macular dystrophy, retinal, 1 136550 Fd Macular dystrophy, 
(North Carolina type) North Carolina type (2) 
6q21-q22.3 COL10A1 iG Collagen, type X, a-1 polypeptide 120110 REa, A Metaphyseal chondrody- 10(Col10a1) 
splasia, Schmid type (3) 
6q22-q23 LAMA2, € Laminin, o-2 (merosin) 156225 REa, A, Fd Muscular dystrophy, con- 10(dy, 
LAMM genital, merosin-negative Lamm) 
(2) 
6q23 ARGI1 RP Arginase, liver 207800 Rea Argininemia (3) 
6q25-q26 RCD1 J Retinal cone dystrophy-1 180020 Ch ?Retinal cone dystrophy-1 (2) 
6q25.1 ESR (e Estrogen receptor 133430 REa,A Breast cancer (1); oestrogen _—_10(Esr) 
resistance (3) 
6q26 PLG (S Plasminogen 173350 REa,A,LD,F Plasminogen Tochigi disease 17(Plg) 
(3); dysplasminogenaemic 
thrombophilia (1); plas- 
minogen deficiency, types 
Iand II (1) 
6q26-q27 OVCS P Ovarian cancer, serous 167000 D Ovarian cancer, serous (2) 
6q27 LPA c Apolipoprotein Lp(a) 152200 REa,A,F,Fd {Coronary artery disease, 
susceptibility to} (1) 
Chr.6 PBCA ig Pancreatic B-cell, agenesis of 600089 D ?Diabetes mellitus, insulin- 
dependent, neonatal (2) 
7p21.3-p21.2 CRS,CSO Cc Craniosynostosis, type I 123100 Ch Craniosynostosis, type 1 (2) 
7p21 ACS3, SCS Cc Acrocephalosyndactyly-3 (Saethre— 101400 Fd,Ch Saethre—Chotzen syndrome 
Chotzen syndrome) (2) 
7p21-p15 MDDC P Macular dystrophy, dominant cystoid 153880 Fd Macular dystrophy, domi- 
(2) nant 
7p15.1-p13 RPO Pp, Retinitis pigmentosa-9 180104 Fd Retinitis pigmentosa-9 (2) 
7p15-p14 GHRHR Gc Growth hormone releasing hormone 139191 REa, A ?Growth hormone deficient — 6(Lit, Ghrhr) 
receptor dwarfism (1) 
7p15-p13 GCK P Glucokinase (hexokinase-4) 138079 Psh, Fd MODY, type IL, 125851 (3) 
7p14 AQP1, Cc Aquaporin 1 (channel-forming integral 107776 REa, A, Fd Colton blood group (3) 
CHIP28, protein, 28 kD) 
CO 
7p13 GLI3 G GLI-Kruppel family member GLI3 165240 REa, A Greig cephalopolysyndactyly 13(Xt) 
(oncogene GLI3) syndrome, 175700 (3) 
7p13-p12.3 PGAM2, c Phosphoglycerate mutase, muscle form 261670 REa, A Myopathy due to phospho- 
PGAMM glycerate mutase 
deficiency (3) 
7p13-p11.2 OGDH P. Oxoglutarate dehydrogenase 203740 Rea o-ketoglutarate dehy- 
(lipoamide) drogenase deficiency (1) 
7p GHS L Goldenhar syndrome 141400 Ch ?Goldenhar syndrome (2) 7cen-q11.2 
ASL ec Argininosuccinate lyase 207900 S, REa, A Argininosuccinicaciduria (3) 5(Asl) 
7q11-q22 CAM P Cavernous angiomatous malformations 116860 Fd Cavernous angiomatous 


malformations (2) 
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Table VII.1 Continued. 


—— SSE eee 
Location Locus symbol Status’ Title 


MIM# Method Disorder(s) Mouse locus 
7ql1.2 CD36 P CD36 antigen (collagen type I) 173510 A [Macrothrombocytopenia] 
(1); platelet glycoprotein 
IV deficiency (3) 
7q11.2 ELN E Elastin 130160 REa,A,F,Fd Supravalvaraorticstenosis, 5(Eln) 
185500 (3); Williams— 
Beuren syndrome, 194050 
(3) 
7qi1.2-q21.3 EEC L Ectrodactyly, ectodermal dysplasia, 129900 Ch ?EEC syndrome (2) 
cleft lip/ palate 
7q11.23 NCF1 P Neutrophil cytosolic factor-1 (47 kD) 233700 REa, A Chronic granulomatous 
disease due to deficiency 
of NCF-1 (3) 
7qi1.23 ZWS1 Cc Zellweger syndrome-1 214100 Ch Zellweger syndrome-1 (2) 
7q21 EPO iG Erythropoietin 133170 REa,A,REb, ?Erythraemia (1) 5(Epo) 
Fd 
7q21-q22 CACNL2A (6 Calcium channel, L type, a-2 114204 Psh, Fd, A Malignant hyperthermia 
polypeptide susceptibility-3, 154276 (3) 
7q21.11 GUSB (e Glucuronidase, B- 253220 S,D, EM Mucopolysaccharidosis 5(Gus) 
VII (3) 
7q21.2-q21.3 SHFM1, € Split hand/foot malformation, type1 183600 Ch Split-hand/split-foot mal- 
SHFD1, formation type 1 (2) 
SHSF1 
7q21.3-q22 PLANHI1, PAII C Plasminogen activator inhibitor, typeI 173360 REa, REb, Thrombophilia due to 
Fd, A,D excessive plasminogen 
activator inhibitor (1); 
haemorrhagic diathesis due 
to PAI1 deficiency (1) 
7q22-q31.1 DRA P Down-regulated in adenoma 126650 A ?Colon cancer (1) 
7q22.1 COL1A2 € Collagen, type I, «-2 polypeptide 120160 S,REa,D,A Osteogenesis imperfecta, 6(Cola2) 
4 clinical forms, 166200, 
166210, 259420, 166220 
(3); Ehlers—Danlos 
syndrome, type VIIA2, 
130060 (3) 
7q31 CLD P Chloride diarrhoea, congenital 214700 Fd Chloride diarrhoea, con- 
genital (2) 
7q31 OBS P Obesity 164160 H, REa ?Obesity (2) 6(ob) 
7q31-q32 DLD, LAD, G Dihydrolipoamide dehydrogenase (E3_ 246900 REa Lipoamide dehydrogenase 
PHE3 component of pyruvate dehy- deficiency (3) 
drogenase complex, 2-oxo-glutarate 
complex) 
7q31-q34 BPGM Pp 2,3-bisphosphoglycerate mutase 222800 A Haemolytic anaemia due to 
bisphosphoglycerate 
mutase deficiency (1) 
7q31-q35 RP10 Cc Retinitis pigmentosa-10 (autosomal 180105 Fd Retinitis pigmentosa-10 (2) 
dominant) 
7q31.1-q31.3 LAMB1 Ee Laminin, B-1 150240 REa, A,Ch ?Cutis laxa, marfanoid 1(Lamb1) 
neonatal type (1) 
7q31.2 CFTR, CF (c Cystic fibrosis transmembrane 219700 E, Fd Cystic fibrosis (3) congenital  6(Cftr) 
conductance regulator bilateral absence of vas 
deferens (3) 
7q31.3-q32 BCP, CBT Ge Blue cone pigment 190900 REa, A Colour blindness, tritan (3)  6(Bep) 
7q32-qter TRY1 P Trypsin-1 276000 REa Trypsinogen deficiency (1) 6(Try1) 
7934 TBXAS1 Cc Thromboxane A synthase 1 (platelet) 274180 A Thromboxane synthase 
deficiency (2) 
7q34-qter SLO iL, Smith—Lemli—Opitz syndrome 270400 Ch ?Smith—Lemli—Opitz 
syndrome (2) 
7q35 CLCN1 i Chloride channel-1, skeletal muscle 118425 H, REa, Fd Myotonia congenita,reces-  6(adr, Clc1) 
sive, 255700 (3); myoto- 
nia congenita, dominant, 
160800 (3) 
7q35-q36 HERG,LQT2 C Long (electrocardiographic) 152427 Fd, REn, A Long QT syndrome-2 (3) 
QT syndrome-2 
7936 HPE3,HLP3 C Holoprosencephaly-3 142945 Ch, Fd rie sage type3 
Heredit ersistence of fetal 142335 Fd ?Hereditary persistence of 
7q36 HPFH2 L eee heteroceltalayy fetal haemoglobin, hetero- 
Indian type cellular, Indian type (2) 
7436 TELL (€ Triphalangeal thumb-polysyndactyly 190605 Fd Triphalangeal thumb-poly- 


syndrome 


syndactyly syndrome (2) 
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Table VII.1 Continued. 
Location Locus symbol Status? Title MIM# Method: Disorder(s) Mouse locus 
Chr.7 HADHB 12 Hydroxyacyl-Coenzyme A dehy- 143450 s 3-hydroxyacyl-CoA dehydro- 
drogenase/3-ketoacyl-Coenzyme genase deficiency (1) 
A thiolase/enoyl-Coenzyme A 
hydratase (trifunctional protein), 
B-subunit 
8pter-p22 EPMR P Epilepsy, progressive, with mental 600143 Fd Epilepsy, progressive, with 
retardation mental retardation (2) 
8p22 LPL, LIPD Cc Lipoprotein lipase 238600 REa, A, Fd HyperlipoproteinaemiaI(1); 8(Lpl) 
lipoprotein lipase defi- 
ciency (3); hyperchylo- 
micronaemia syndrome, 
familial (3) 
8p21.1 GSR Cc Glutathione reductase 138300 S,D. Haemolyticanaemiadueto 8(Gr1) 
glutathione reductase 
deficiency (1) 
8p21-p12 CLU; GLE ( Clusterin (complement lysis inhibitor, 185430 REa, REb, ?{Atherosclerosis, suscepti- 14(Sgp2) 
SGP2, SP-40,40; sulphated glycoprotein 2; A, RE bility to} (3) 
TRPM2 testosterone-repressed prostate 
message-2; apolipoprotein J) 
8p21-p11.2 LHRH,GNRH P Luteinizing hormone releasing 152760 REa, A ?Hypogonadotropichypo- —-14(Gnrh) 
hormone (gonadotropin releasing gonadism due to GNRH 
hormone) deficiency, 227200 (1) 
8p12 PLAT, TPA (S Plasminogen activator, tissue type 173370 REa,A,REb Plasminogen activator 8(Plat) 
deficiency (1) 
8p12-p11.2 FGFRI,FLT2 C Fibroblast growth factor receptor-1 136350 REa, A Pfeiffer syndrome, 101600 (3) 
(fms-related tyrosine kinase-2) 
8p12-pi1 WRN € Werner syndrome 277700 Fd Werner syndrome (2) 
8p12-q13 SPG5A P Spastic paraplegia 5A (autosomal 270800 Fd Spastic paraplegia 5A (2) 
recessive) 
8p11.2 ANK1, SPH2 Ankyrin-1, erythrocytic 182900 F,Ch,D,REa, Spherocytosis-2 (3) 8(nb) 
A, Fd, REb 
8p11-q21 RP1 R Retinitis pigmentosa-1 180100 Fd Retinitis pigmentosa-1 (2) 
8q EBN2 P Epilepsy, benign neonatal-2 (benign 121201 Fd Epilepsy, benign neonatal, 
familial neonatal convulsions) type 2 (2) 
8q TIPLRAVED GE Tocopherol transfer protein, 600415 Fd,LD,REc Ataxia with isolated vita- 
min E deficiency, 277460 (3) 
8ql1 HYRC1, Cc Hyperradiosensitivity of murine SCID 202500 CA ?Severe combined immuno- __16(scid) 
DNPK1 mutation, complementing-1 deficiency, type I (1) 
8q12 SGPA, PSA P. Salivary gland pleomorphicadenoma _ 181030 Ch Salivary gland pleomorphic 
adenoma (2) 
8q13-q21.1 CMT4A P Charcot—Marie—Tooth neuropathy- 214400 Fd Charcot—Marie—Tooth 
4A (autosomal recessive) disease,type IV A (2) 
8q13.3 BOR iS Branchio-otorenal syndrome 113650 Ch, Fd Branchio—otorenal dysplasia 
(2) 
8q21 CYP11B1, Cc Cytochrome P450, subfamily XIB, 202010 REa, A,Ch Adrenal hyperplasia, con- 
P450C11 polypeptide-1; 11-B-hydroxylase; genital, due to 11-B- 
corticosteroid methy]l-oxidase II hydroxylase deficiency (3); 
(CMO II) Aldosteronism, glucocor- 
ticoid-remediable (3) 
8q21 CYP11B2 Cc Cytochrome P450, subfamily XIB, 124080 REa CMO II deficiency (3) 
polypeptide-2 
8q21.1 PXMP3, Cc Peroxisomal membrane protein-3 170993 RE Zellweger syndrome-3 (3) 
PAFI, (35 kD) 
PMP35 
8q22 CA2 Cc Carbonic anhydrase II 259730 REa, H Renal tubular acidosis- 3(Car2) 
osteopetrosis syndrome (3) 
8q22-q23 CSH1 P Cohen syndrome 1 216550 Fd Cohen syndrome (2) 
8q24 EBS1 (e Epidermolysis bullosa simplex-1 131950 F Epidermolysis bullosa, 
(Ogna) Ogna type (2) 
8q24 PDS L Pendred syndrome 274600 Ch ?Pendred syndrome (2) 
8q24 VMD1 (G Macular dystrophy, atypical vitelliform 153840 F Macular dystrophy, atypical 
vitelliform (2) 
8q24.11-q24.13 EXT1 e Exostoses (multiple) 1 133700 Ch, Fd Exostoses, multiple, type 1 (2) 
8q24.11-q24.13 LGCR,LGS,  C Langer—Giedion syndrome chromo- 150230 Ch Langer-Giedion syndrome (2) 
TRPS2 some region 
8q24.12 TRPS1 12) Trichorhinophalangeal syndrome, 190350 Gh Trichorhinophalangeal 
type I typeI (2) syndrome, 
8q24.12-q24.13 MYC ie Avian myelocytomatosis viral (v-myc) 190080 REa, A Burkitt lymphoma (3) 15(Myc) 


oncogene homologue 
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Table VII.1 Continued. 


Location Locus symbol Status* Title MIM#> Method: Disorder(s) Mouse locus 


8q24.2-q24.3 TG iG Thyroglobulin 188450 A,REa,REb Hypothyroidism, hereditary 15(Tgn; cog) 
congenital (3); goitre, 
adolescent multinodular 
(1); goitre, non-endemic. 


simple (3) 
Chr.8 RTS L Rothmund-Thomson syndrome 268400 Ch ?Rothmund-Thomson 
syndrome (2) 
9p24 EAAC1 P High-affinity glutamate transporter 133550 REa, A ?Dicarboxylicaminoaciduria, 
EAAC1 222730 (1) 
9p24 OVC Pp. Oncogene OVC (ovarian adeno- 164759 Ch Ovarian carcinoma (2) 
carcinoma oncogene) 
9p23 TYRP, CAS2 ec Tyrosinase-related protein 1 115501 Psh, REa, A Albinism, brown, 203290 (1) 4(b;trp1) 
9p22 GLDC, Cc Glycine dehydrogenase 238300 Ch,A Hyperglycinemia, isolated 
HYGNI1, (decarboxylating; glycine non-ketotic, type I (3) 
GCsP decarboxylase, glycine cleavage 
system protein P) 
9p22-p21 LALL P Lymphomatous acute lymphoblastic 247640 Ch Leukaemia, acute lym- 
leukaemia phoblastic (2) 
9p21 CDKN2, P Cyclin-dependent kinase inhibitor 2 600160 RE,D Melanoma (1) 
MTS1, P16 (p16, inhibits CDK4) 
9p21 MLM, Cc Melanoma 155601 D, Fd Melanoma, cutaneous 
CMM2, malignant (2) 
MLM2 
9p21 IFNlo,IFNA C Interferon, type 1, cluster 147660 REa, A, RE Interferon, o,, deficiency (1) 4(Ifa) 
9p21-q21 AMCD1,DA1 P Arthrogryposis multiplex congenita, 108120 Fd Distal arthrogryposis-1 (2) 
distal, type 1 
9p13 GALT G Galactose-1-phosphate uridyl- 230400 S,D)F Galactosaemia (3) 4(Galt) 
transferase 
9q21 GCNT2 P Glucosaminyl (N-acetyl) transferase 2, 600429 A [li blood group, 110800] (1) 
I-branching enzyme 
9p13-q11 CHH P, Cartilage-hair hypoplasia 250250 Fd Cartilage-hair hypoplasia (2) 
9p11 MROS i Melkersson—Rosenthal syndrome 155900 Ch ?Melkersson—Rosenthal 
syndrome (2) 
9p VMCM (e Venous malformations, multiple 600195 Fd Venous malformations, 
cutaneous and mucosal musitiple cutaneous and 
mucosal (2) 
9q13-q21.1 FRDA (€ Friedreich ataxia 229300 Fd Friedreich ataxia (2) 
9q22 ALDOB GC Aldolase B, fructose-bisphosphatase 229600 REb, REa, Fructose intolerance (3) 
A,D 
9q22 HSD17B3, iP Hydroxysteroid (17-B) dehydrogenase 3 264300 A Pseudohermaphroditism, 
EDH17B3 male, with gynaecomastia 
(3) 
9q31 ESS1 if Epithelioma, self-healing, squamous 1, 132800 Fd Epithelioma, self-healing, 
Ferguson—Smith type squamous 1, Ferguson— 
Smith type (2); ?Basal cell 
carcinoma (2) 
9q31 NBCCS, BCNS C Naevoid basal cell carcinoma syndrome 109400 Fd, D Basal cell naevus syndrome (2) 
9q31 TAL2 P T-cell acute lymphocytic leukaemia-2 186855 REa, A, RE, Leukaemia-2, T-cell acute 4(Tal2) 
Ch lymphoblastic (3) 
9q31-q33 DYS iP Dysautonomia (Riley-Day syndrome, 223900 Fd, LD Dysautonomia, familial (2) 
hereditary sensory autonomic 
neuropathy type III) 
9q31-q33 FCMD P Fukuyama type congenital muscular 253800 Fd, LD Fukuyama type congenital 
dystrophy muscular dystrophy (2); 
?Walker—Warburg syn- 
drome, 236670 (2) 
9q32 AFDN Jb Acrofacial dysostosis, Nager type 154400 Ch sapien dysostosis, Nager 
pe 
9q32-q34 DYTI1 (@ Dystonia-1, torsion (autosomal 128100 Fd Torsion dystonia (2) 
dominant) 
9q33-qter ITO I Hypomelanosis of Ito 146150 X/A ?Hypomelanosis of Ito (2) ' 
9q34 ALAD G Aminolaevulinate, 6-, dehydratase 125270 E,S,A, REa Porphyria, acute hepatic (3); 4(Lv) 
{lead poisoning, sus- 
ceptibility to} (3) 
9q34 ASS Cc Argininosuccinate synthetase ees > A REa, Fd plat esr 2(Ass1) 
: ; opamine-B- 
ae cies ee eee 2(Dbh) 
9q34 GSN P Gelsolin 137350 A, REa, RE Ae Finnishtype,  2(Gsn) 
9q34 TSC1 (e Tuberous sclerosis-1 191100 FE, Fd Tuberous sclerosis-1 (2) 
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Location Locus symbol Status’ Title MIM#> Methods Disorder(s) Mouse locus 
9q34.1 ABL1 ce Abelson murine leukaemia viral (v-abl) 189980 REa, Ch, A Leukaemia, chronic 2(Abl) 
oncogene homolog 1 myeloid (3) 
9q34.1 AK1 € Adenylate kinase-1 103000 ES, D, Fe Hemolytic anemia due to 2(Ak1) 
adenylate kinase 
deficiency (1) 
9q34.1 (es) (e Complement component-5 120900 REa, A C5 deficiency (1) 2(He) 
9q34.1 CRAT,CAT1 P Carnitine acetyltransferase 600184 REa ?Carnitine acetyltransferase 
deficiency (1) 
9q34.1 D9S46E, CAN P CAIN gene 114350 Ch Leukaemia, acute myeloid (2) 2(Can) 
9q34.1 ENG, END, (S Endoglin 131195 A,H, Fd Hereditary haemorrhagic 2(Eng) 
HHT, ORW telangiectasia,187300 (3) 
9q34.1 EPB72 Cc Erythrocyte membrane protein 185000 REa, Ch, A ?Stomatocytosis I (1) 2(Epb7.2) 
band 7.2 (stomatin) 
9q34.1 NPS1 Cc Nail-patella syndrome 161200 FE, Fd Nail-patella syndrome (2) 
9q34.1 XPA (e Xeroderma pigmentosum, A 278700 S,A,M Xeroderma pigmentosum, 4(Xpa) 
complementation group type A (3) 
9q34.2-q34.3 COL5A1 € Collagen V, o-1 polypeptide 120215 REa,A Ehlers—Danlos syndrome, 2(Col5a1) 
type unspecified (3) 
9q34.2 NOTCH1, E Notch (Drosophila) homologue 1 190198 Ch, H,A Leukaemia, T-cell acute 2(Notch1) 
TANI (translocation-associated) lymphoblastic (2) 
10p12-q23.2 GBM Cc Glioblastoma multiforme 137800 D Glioblastoma multiforme (2) 
10q EPT P Epilepsy, partial 600512 Fd Epilepsy, partial (2) 
10q PEO P Progressive external ophthalmoplegia, 157640 Fd PEO with mitochondrial 
autosomal dominant, with multiple DNA deletions (2) 
mitochondrial DNA deletions 
10q11.2 RET,MEN2A C RET transforming sequence; 164761 A,REn,Fd, Multiple endocrine neoplasia 
oncogene RET Ch,D IIA, 171400 (3); medullary 
thyroid carcinoma, 155240 
(3); multiple endocrine 
neoplasia IIB, 162300 (3); 
Hirschsprung disease, 
142623 (3) 
10q11.2-q21 MBL e Mannose-binding lectin, soluble 154545 REa, A, Fd {Chronic infections, due 14(Mbl1) 
(opsonic defect) to opsonin defect} (3) 
10q11 ERCC6,CKN2 P Excision repair cross complementing 133540 A Cockayne syndrome-2, 
rodent repair deficiency, comple- late onset, 216410 (2) 
mentation group 6 
10q11-q12 D10S170, TST1, C DNA segment, single copy, probe pH4 188550 REa, A Thyroid papillary carci- 
PIG Tre (transforming sequence, thyroid-1, noma (1) 
from papillary thyroid carcinoma) 
10q21-q22 PSAR SAPS 5€ Prosaposin (sphingolipid activator 176801 S,REa,A,D — Metachromatic leukodys- 10(Psap) 
protein-1) trophy due to deficiency 
of SAP-1 (3); Gaucher 
disease, variant form (3) 
10q22 DCOH ¢C Dimerization cofactor of hepatic 126090 REa,H,A Hyperphenylalaninemia 10(Dcoh) 
nuclear factor 10, (TCF1), due to pterin-4a-carbino- 
lamine dehydratase 
deficiency, 264070 (3) 
10q22 HK1 (E Hexokinase-1 142600 S,D,A,REa Haemolyticanaemiadueto _10(Hk1) 
hexokinase deficiency (1) 
10q23-q24 RBP4 (& Retinol-binding protein-4, interstitial 180250 REa, A ?Retinol binding protein, 19(Rbp4) 
deficiency of (1) 
10q24 HOXU,TELa ee? Homeobox-11 (T-cell leukaemia-3 186770 Ch Leukaemia, T-cell acute 
associated breakpoint, homologous lymphocytic (2) 
to Drosophila Notch) 
10q24-q25 LIPA (G Lipase A, lysosomal acid, cholesterol _ 278000 S,H Wolman disease (3); 19(Lip1) 
esterase cholesteryl ester storage 
disease (3) 
10q24.1-q24.3  CYP2C, (e Cytochrome P450, subfamily IC; 124020 REa,A Mephenytoin poor 19(P4502c) 
CYP2E19 mephenytoin 4\(fm-hydroxylase) metabolizer (3) 
10q24.3 BPAG2 (e Bullous pemphigoid antigen-2 113811 A,H Generalized atrophic benign 19(Bpag2) 
epidermolysis bullosa, 
226650 (1) 
10q24.3 CYPI7, G Cytochrome P450, subfamily XVII; 202110 REa, H,A Adrenal hyperplasia, con- 19(Cyp17) 
P450C17 steroid 17-0-hydroxylase genital, due to 17-a- 
hydroxylase deficiency (3) 
10q25 PAX2 G Paired box homeotic gene-2 167409 REa, A Optic nerve coloboma with 19(Pax2) 
renal anomalies, 120330 (3) 
10q25.2-q26.3 UROS P Uroporphyrinogen III synthase 263700 REa, Psh Porphyria, congenital 7(Uros) 
erythropoietic (3) 
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Location Locussymbol Status’ Title MIM#> Methods Disorder(s) Mouse locus 
10q26 FGFR2,BEK, C Fibroblast growth factor receptor-2 176943 A, Psh, Fd Crouzon craniofacial dysos- 7(Fgfr2) 
CFD1, JWS (bacteria-expressed kinase) tosis, 123500 (3); Jackson— 
Weiss syndrome, 123150 
(3); Apert syndrome, 
101200 (3); Pfeiffer syn- 
drome, 101600 (3) 
10q26 OAT G Ornithine aminotransferase 258870 S,REa,A,Fd Gyrate atrophy of choroid 7(Oat) 
and retina with ornithine- 
mia, B6 responsive or 
unresponsive (3) 
10q26.1 PNLIP P Pancreatic lipase 246600 REa, A Pancreatic lipase deficiency 
(1) 
lpter-p15.4 BWS,WBS G Beckwith-Wiedemann syndrome 130650 Ch, Fd Beckwith—-Wiedemann 
syndrome (2) 
I1pter-p13 AMPD3 P Adenosine monophosphate deaminase 102772 REa [AMP deaminase deficiency, 
3 (isoform E) erythrocytic] (3) 
11p15.5 IDDM2 P. Insulin-dependent diabetes mellitus-2 125852 Fd Insulin-dependent diabetes 
mellitus-2 (2) 
11p15.5 HBB (e Haemoglobin B 141900 LD, AAS, Sickle cell anaemia (3); tha- 7(Hbb) 
EF Fd lassaemias, - (3); metho- 
moglobinaemias, B- (3); 
erythraemias, B- (3); Heinz 
body anaemias, B- (3); 
HPFH, deletion type (3) 
11p15.5 HBGR Cc Haemoglobin, y, regulator of 142270 RE ?Hereditary persistence of 
fetal haemoglobin (3) 
11p15.5 HBG1 G Haemoglobin, yA 142200 RE HPFH, non-deletion type 
A(3) 
11p15.5 HBG2 Ee Haemoglobin, yG 142250 RE HPFH, non-deletion type 
G3) 
11p15.5 INS & Insulin 176730 HS, A, REb, Diabetes mellitus, rare 6(Ins1); 
Fd,D form (1); MODY one form  7(Ins2) 
(3); hyperproinsulinaemia, 
familial (3) 
11p15.5 LOT1 P Long (electrocardiographic) QT 192500 Fd Long OT syndrome-1 (2) 
syndrome-1; Ward—Romano 
syndrome 
11p15.5 MTACRI, G Multiple tumour associated chromo- 194071 D Wilms’ tumour, type 2 (2); 
WT2 some region-1 adrenocortical carcinoma, 
hereditary, 202300 (2) 
11p15.5 RMS1 RP. Rhabdomyosarcoma, embryonal 268210 D Rhabdomyosarcoma (2) 
11p15.5 TH, TYH G Tyrosine hydroxylase 191290 REa, A, Fd, nal syndrome, recessive 7(Th) 
RE (3 
11p15.4 LDHA,LDH1 C Lactate dehydrogenase A 150000 S, D, REb, Exertional myoglobinuria 7(Ldh1) 
CA due to deficiency of 
LDH-A (3) 
11p15.4-p15.1 SMPD1,NPD P Sphingomyelin phosphodiesterase-1, 257200 REa, A Niemann—Pick disease, 7(Smpd1) 
acid lysosomal type A (3); Niemann-Pick 
disease, type B (3) 
11p15.3-p15.1 PTH G Parathyroid hormone 168450 REa, REb, Hypoparathyroidism, auto- 7(Pth) 
A, Fd somal dominant (3); 
hypoparathyroidism, 
autosomal recessive (3) 
11p15.1 USH1C G Usher syndrome-1C (autosomal 276904 Fd Usher syndrome, type 1 C(2) 
recessive, severe) ; 
11p15.1-p14. PHHI Cc Persistent hyperinsulinaemic hypo- 256450 Fd, LD Persistent hyperinsulinaemic 
glycaemia of infancy (nesidioblastosis) ‘pet Sel of infancy 
llp15 RBINI1, (© Rhombotin-1 186921 Ch, D Leukaemia, T-cell acute 7(Ttg1) 
RHOM1 lymphoblastic (2) 
11p14-p13 HVBS1 G Hepatitis B virus integration site-1 114550 REa, A, Ch Hepatocellular carcinoma (1) a 
p13 CAT G Catalase 115500 S,D, Fd Acatalasaemia (3) 2(Cas1) 
11p13 CD59 (€ CD59 antigen (p18-20) 107271 REa, A,D CD59 deficiency (3) a 15(Ly6) 
l1p13 FSHB (E Follicle-stimulating hormone, 136530 D, REa ?Male infertility, familial (1) | 2(Fshb) 
olypeptide 
11p13 FSHB (& sg alo hormone, 136530 D, REa ?Male infertility, familial (1) | 2(Fshb) 
ol tide 
11p13 PAX6,AN2.  C one Hes eae gene-6 106210 Ch,Fd Aniridia (3); Peters 2(Sey) 


anomaly (3); cataract, 
congenital, with late-onset 
corneal dystrophy (3) 


eee 


Ci 
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Location Locus symbol Status: Title MIM# = Method« Disorder(s) Mouse locus 
11p13 RBTNL1, P Rhombotin-like 1 180385 REa, REc Leukaemia, acute, T-cell (2) 
RHOM2, 
TTG2 
11p13 TCL2 er T-cell leukaemia/lymphoma-2 151390 Ch, RE, A, Leukaemia, acute T-cell (2) 
REa 
11p13 WTl1 (@ Wilms’ tumour 1 194070 Ch Wilms’ tumour (3); Denys— —_2(Wt1) 
Drash syndrome (3) 
11p13-q13 CMH4 P Cardiomyopathy, hypertrophic, 4 115197 Fd Cardiomyopathy, familial 
hypertrophic, 4 (2) 
11p12-p11.12 PFM L Parietal foramina 168500 Ch ?Parietal foramina (2) 
11p12-p11 ACP2 ec Acid phosphatase 2, lysosomal 171650 S, REa ?Lysosomal acid phos- 2(Acp2) 
phatase deficiency (1) 
11pli-ql1 EXT2 I Exostoses (multiple) 2 133701 Fd ?Exostoses, multiple, type 2 
(2) 
11p11-q11 SCA5 @ Spinocerebellar ataxia 5 600224 Fd Spinocerebellar ataxia, type 5 
(2) 
11p11-q12 Fo GE Coagulation factor II (thrombin) 176930 REa, A Hypoprothrombinaemia (3); 2(Cf2) 
Dysprothrombinaemia (3) 
llq CPT1 P Carnitine palmitoyltransferase I 255110 Psh, A ?Carnitine palmitoyl- 
transferase I deficiency 
(2) 
11q JBS L Jacobsen syndrome 147791 Ch 2Jacobsen syndrome (2) 
11g re ie Pyruvate carboxylase 266150 REa, H Pyruvate carboxylase 19(Pe) 
deficiency (1) 
11ql1-q13.1 CINH e Complement component-1 inhibitor 106100 REa, A Angio-oedema, hereditary (3) 
11q12-q13 IGER, APY (Cc IgE responsiveness (atopic) 147050 Fd Atopy (2) 
11q13 BBS1 P Bardet-Bied| syndrome 1 209901 Fd Bardet—Bied] syndrome 1 (2) 
11q13 CCND1, re Cyclin D1 168461 REn,R,REa, Parathyroid adenomatosis 1 
PRAD1 A (2); centrocytic lymphoma 
(2) 
11q13 IDDM4 P Insulin-dependent diabetes mellitus4 600319 Fd Diabetes mellitus, insulin- 
dependent, 4 (2) 
11q13 MEN1 iC Multiple endocrine neoplasia, type I 131100 Fd, D Multiple endocrine neo- 
plasia I (1); prolactinoma, 
hyperparathyroidism, 
carcinoid syndrome (2) 
11q13 NDUFV1, P NADH dehydrogenase (ubiquinone) 161015 REa, A ?Mitochondrial complex I 
UQORI flavoprotein 1 (51 kD) deficiency, 252010 (1) 
11q13 PYGM Cc Phosphorylase, glycogen, muscle 232600 REb,Fd,REn McArdle disease (3) 19(Pygm) 
1iql3 ROM1,ROSP1 P Rod outer segment membrane protein-1 180721 REa, A Retinitis pigmentosa, 19(Rosp1) 
digenic (3) 
11q13 RT6 iP RT6 antigen (rat) homologue 180840 REa, A ?{Susceptibility to IDDM} (1) 7(Rt6) 
11q13 SMTN P Somatotrophinoma 102200 D Somatotrophinoma (2) 
11q13 ST3 € Suppression of tumorigenicity-3 191181 §,D Cervical carcinoma (2) 
(tumour-suppressor gene, HELA 
cell type) 
11q13 VMD2 Cc Vitelliform macular dystrophy 153700 Fd, Psh Macular dystrophy, vitel- 
(Best disease) liform type (2) 
11q13 VRNI P. Vitreoretinopathy, neovascular 193235 Fd Vitreoretinopathy, neo- 
inflammatory vascular inflammatory (2) 
11q13-q23 EVR1,FEVR C Exudative vitreoretinopathy-1 (auto- —_ 133780 Fd Vitreoretinopathy, exudative, 
somal dominant; Criswick-Schepens familial (2) 
syndrome) 
11q13.3 BCL1 iG B-cell CLL/lymphoma-1 151400 RE, Ch Leukaemia/lymphoma, 
B-cell, 1 (2) 
11q13.5 DFNB2, i Deafness, autosomal recessive 600060 Fd Deafness, non-syndromic, 7(sh1) 
NSRD2 recessive, 2 (2) 
11q13.5 USHI1B FP Usher syndrome-1B 276903 Fd Usher syndrome, type 1B (2) 
(autosomal recessive, severe) 
11q14-q21 TYR Cc Tyrosinase 203100 REa,A,H,F — Albinism, oculocutaneous, 7(Tyr) 
type IA (3) 
11q22-qter ANC Lb Anal canal carcinoma 105580 Ch ?Anal canal carcinoma (2) 
11q22.3 ATA, AT1 Cc Ataxia-telangiectasia 208900 Fd,C,M Ataxia-telangiectasia (2) 
(complementation) groups A, C, D) 
11q22.3-q23.1 ACAT Cc Acetyl-Coenzyme A acetyltransferase 203750 A 3-ketothiolase deficiency (3) 
(acetoacetyl Coenzyme A thiolase) 
11q22.3-q23.2 PGL,CBT1 E Paraganglioma (carotid body tumours) 168000 Fd Paraganglioma (2) 
11q22.3-q23.3 PTS P 6-pyruvoyltetrahydropterin synthase 261640 A Phenylketonuria due to PTS 


deficiency (3) 
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11q23 APOA1 e Apolipoprotein A-I 107680 REa, RE, ApoA-I and apoC-III 9(Apoal) 
Fd, F,D deficiency, combined (3); 
hypertriglyceridaemia, 
one form (3); hypoal- 
phalipoproteinaemia (3); 
amyloidosis, lowa type, 
107680.0010 (3) 
11q23 APOC3 Cc Apolipoprotein C-III 107720 REa, RE, F Hypertriglyceridaemia (3) 
11q23 BRCA3 B. Breast cancer, 11; 22 translocation 600048 Ch Breast cancer-3 (2) 
associated 
11q23 MLL, HRX, Gs Myeloid/lymphoid, or mixed-lineage 159555 Ch, RE Leukaemia, myeloid / 9(All1) 
HTRX1 leukaemia; trithorax (Drosophila) lymphoid or mixed- 
homologue lineage (2) 
11q23 TEPT IE Thrombocytopenia, Paris—Trousseau 188025 Ch ?Thrombocytopenia, 
type (deletion 11q23 syndrome) Paris—Trousseau type (2) 
11q23.1 PORC Porphyria, acute, Chester type 176010 Fd Porphyria, Chester type (2) 
11q24.1-q24.2 MBS, (€ Hydroxymethylbilane synthase 176000 $,D Porphyria, acute inter- 9(Ups) 
P PBGD, UPS mittent (3) 
Chr.11 GIF Gastric intrinsic factor 261000 Rea Anaemia, pernicious, con- 
genital, due to deficiency 
of intrinsic factor (1) 
12pter-p12 CD4 G CD4 antigen (p55) 186940 REa, A [CD4(+) lymphocyte 6(Ly4) 
deficiency] (2); {Lupus 
erythematosus, suscep- 
tibility to} (2) 
12pter-p12 DRPLA (S Dentatorubro-pallidoluysian atrophy 125370 Fd Dentatorubro-pallidoluy- 
sian atrophy (3) 
12pter-q12 BCT1 iS Branched chain aminotransferase-1 113520 S ?Hyperleucinaemia-isoleucin- 
aemia or hypervalinaemia (1) 
12p13.3 VWE, F8VWF C Coagulation factor VII] VWF 193400 A,REa,REb, von Willebrand disease (3) 6(Vwf) 
(von Willebrand factor) Fd 
12p13.3-p12.3 A2M Cc o-2-macroglobulin 103950 REa,A Emphysema due to a-2 6(A2m) 
12p13.3-p11.2_ ACLS Is Acrocallosal syndrome 200990 Ch ?Acrocallosal syndrome (2) 
macroglobulin deficiency 
(1) 
12p13 C1R & Complement component-1, 216950 REa, Fd, Cir/Cls deficiency, com- 
r subcomponent RE,A bined (1) 
12p13 C1S (€ Complement component-1, 120580 REa, Fd, RE,A Cir/Cls deficiency, com- 
s subcomponent bined (1) 
12p13 KCNAI, Cc Potassium voltage-gated channel, 176260 REa,Fd,A,H Episodicataxia/myokymia 6(Kcnal) 
AEMK shaker-related subfamily, member 1 syndrome, 160120 (3) 
12p13 MPE E Malignant proliferation, eosinophil 131440 Ch ?Eosinophilic myelopro- 
liferative disorder (2) 
12p13 TPH ic Triosephosphate isomerase-1 190450 S,D,R,REa Haemolyticanaemiadueto  6(Tpil) 
triosephosphate 
isomerase deficiency (3) 
12p12.1 KRAS2, ¢€ Kirsten rat sarcoma-2 viral (v-Ki-ras2) 190070 REa, A, Fd Colorectal adenoma (1); 6(Kras2) 
RASK2 oncogene homologue colorectal cancer (1) 
12p12.1-p11.2  PTHLH P Parathyroid hormone-like hormone 168470 REa, A ?Humoral hypercalcaemia —_6(Pthlh) 
of malignancy (1) 
12q11-q13 KRT1 € Keratin-1 139350 H, REa, A Epidermolytic hyperker- 15(Krt2) 
atosis, 113800 (3); kerato- 
derma, palmoplantar, non- 
epidermolytic (3) 
12q11-q13 KRT2E 1? Keratin-2e 600194 Fd Ichthyosis bullosa of 
Siemens, 146800 (3) 
12p11-q13 KRT5 B Keratin-5 148040 A, Fd Epidermolysis bullosa 
simplex, Dowling—Meara 
type, 131760 (3); Epider- 
molysis bullosa simplex, 
Koebner type, 131900 (3); 
Epidermolysis bullosa, 
Weber—Cockayne type, 
131800 (3) 
12q11-q13 PPKB ? Palmoplantar keratoderma, Bothnia 600231 Fd Palmoplantar keratoderma, 
aie ae tales syn 
i ma 1 antigen 155740 REa, A ? = - 
12q12-q13 CD63,MLA1_ P CD63 antigen (melano gen) tao 
12q12-q14 VDR iF Vitamin D (1,25-dihydroxyvitamin D3) 277440 REa, A Rickets, vitamin D-resistant 


receptor 


(3); 2osteoporosis, involu- 
tional (1) 
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Location Locus symbol Status? Title MIM#> Methods Disorder(s) Mouse locus 
12q13 AQP2 (ce Aquaporin 2 (collecting duct) 107777 A Diabetes insipidus nephro- 
genic, autosomal reces- 
sive (3) 
12q13.1-q13.2 DDIT3, eG DNA-damage-inducible transcript-3 126337 REa, A,Ch Myxoid liposarcoma (3) 
GADDI153, 
CHOP10 
12q13.11-q13.2 COL2A1 Cc Collagen, type II, «-1 polypeptide 120140 REa, A Stickler syndrome, type I (3); 
SED congenita (3); Kniest 
dysplasia (3); achondro- 
genesis-hypochondro- 
genesis, type II (3); 
osteoarthrosis, precocious 
(3); Wagner syndrome, type 
II (3); SMED Strudwick 
type (3) 
12q13.2-q24.1 FEOM,CFEOM P Fibrosis of the extraocular muscles, 135700 Fd Fibrosis of the extraocular 
congenital muscles, congenital (2) 
12q14 GNS, G6S P N-acetylglucosamine-6-sulphatase 252940 A, REa Sanfilippo syndrome D (1) 
12q14 PDDR,VDD1 C Pseudo-vitamin D dependency rickets 1 264700 Fd Pseudo-vitamin D depen- 
dency rickets 1 (2) 
12q14-qter PPD R 4-hydroxyphenylpyruvate dioxygenase 276710 REa Tyrosinaemia, type III (1) 
12q15 BABL,LIPO C Lipoma (breakpoint in benign lipoma) 151900 Ch Lipoma, benign (2); ?multiple 
lipomatosis (2) 
12q21 CNA2 PB. Cornea plana 2 (autosomal recessive) 217300 Fd Cornea plana congenita, 
recessive (2) 
12q21.3-q22 HOS FP. Holt-Oram syndrome 142900 Fd Holt-Oram syndrome (2) 
12q22 MGCT P Male germ cell tumour 273300 D,Ch Male germ cell tumour (2) 
12q22-q23 HAL, HSTDC Histidine ammonia-lyase (histidase) 235800 REa,A [Histidinaemia] (1) 10(Hstd) 
12q22-qter ACADS P Acyl-Coenzyme A dehydrogenase, 201470 REa Acyl-CoA dehydrogenase, 5(Bed1) 
C-2 to C-3 short chain short-chain, deficiency 
of (3) 
12q22-qter MODY3 Pp. Maturity-onset diabetes of the young, 600496 Fd Maturity-onset diabetes of 
type III the young, type III (2) 
12q22-qter NS1 1 Noonan syndrome 1 163950 Fd Noonan syndrome-1 (2) 
12q23-q24.1 DAR ( Darier disease (keratosis follicularis) 124200 Fd Darier disease (keratosis 
follicularis) (2) 
12q24 SCA2 € Spinocerebellar ataxia 2 (olivoponto- _—_ 183090 Fd Spinocerebellar atrophy II (2) 
cerebellar ataxia 2, autosomal 
dominant) 
12q24.1 IFNG Cc Interferon, gamma 147570 REa,A Interferon, immune, defi- 10(Ifg) 
ciency (1) 
12q24.1 PAH, PKU1 € Phenylalanine hydroxylase 261600 REa, A, Fd Phenylketonuria (3); [hyper- 10(Pah) 
phenyl-alaninaemia, 
mild] (3) 
12q24.2 ALDH2 CG Aldehyde dehydrogenase-2, mito- 100650 REa, A,H Alcohol intolerance, acute 4(Aldh2) 
chondrial (3); {?fetal alcohol 
syndrome} (1) 
Chr.12 LYZ PR Lysozyme 153450 Rea Amyloidosis, renal, 105200 (3) 
Chr.12 MVK,MVLK_ P Mevalonate kinase 251170 REa Mevalonicaciduria (3) 
13q12 DFNB1 P Deafness, neurosensory, autosomal 220290 Fd Deafness, neurosensory, AR, 
recessive, 1 1(2) 
13q12-q13 BRCA2 c Breast cancer 2, early onset 600185 Fd Breast cancer 2, early onset (2) 
13q12-q13 DMDA1 P Duchenne-like muscular dystrophy, 253700 Fd Muscular dystrophy, Duch- 
autosomal recessive enne-like, autosomal (2) 
13q12.2-q13_ MBS 13 Moebius syndrome 157900 Ch ?Moebius syndrome (2) 
13q14 D13825,DBM P Disrupted in B-cell neoplasia 109543 D Leukaemia, chronic lym- 
phocytic, B-cell (2) 
13q14-q31 LESD L Letterer-Siwe disease 246400 Ch ?Letterer-Siwe disease (2) 
13q14.1 FKHR 1p Fork head (Drosophila) homologue1 136533 Ch Rhabdomyosarcoma, 
(rhabdomyosarcoma) alveolar, 268200 (3) 
13q14.1-q14.2 RBI ‘e Retinoblastoma-1 180200 Ch, F, Fd Retinoblastoma (3); osteo- 14(Rb1) 
sarcoma, 259500 (2); 
bladder cancer, 109800 (3) 
13q13.3-q21.1 ATP7B,WND C ATPase, Cu transporting, 277900 E Fd Wilson disease (3) 
B polypeptide 
13q21.1-q32_ CLN5 P Ceroid-lipofuscinosis, neuronal-5 256731 Fd Ceroid-lipofuscinosis, neu- 
ronal, variant late infantile 
form (2) 
13q22 EDNRB, (e Endothelin receptor type B 131244 REa,Ch,LD _ Hirschsprung disease-2, 
HSCR2 600155 (3) 


i a 
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Locus symbol Status’ Title MIM# =Methods Disorder(s) Mouse locus 
13q32 PCCA Ee Propionyl Coenzyme A carboxylase, 232000 REa,D,A,Fd Propionicacidaemia, typeI or 14(Pcca) 
a polypeptide pecA type (1) 
13q33 ERCC5,XPG C Excision-repair, complementing 133530 S,A Xeroderma pigmentosum, 
defective, in Chinese hamster, group G (3) 
number 5 
13q34 DJS 5 Dubin-Johnson syndrome 237500 LD ?Dubin-Johnson syndrome (2) 
13q34 F7 GC Coagulation factor VII 227700 D Factor VII deficiency (3) 
13q34 F10 G Coagulation factor X 227600 D,A,REa Factor X deficiency (3) 8(CF10) 
13q34 HHH L Hyperornithinaemia-hyperammonaemia- 238970 D ?HHH syndrome (2) 
homocitrullinaemia syndrome 
13q34 STGD2 BP Macular dystrophy with flecks, type 2 153900 Fd Stargardt disease 2 (2) 
Chr.13 BRCD1 P Breast cancer, ductal, suppressor-1 211410 D Breast cancer, ductal (2) 
Chr.13 CPB2 P Carboxypeptidase B2 (plasma) 212070 + Psh Carboxypeptidase B 
deficiency (1) 
14q MPDI1 P Myopathy, distal 1 160500 Fd Myopathy, distal (2) 
14q SPG3 P Spastic paraplegia-3 182600 Fd Spastic paraplegia-3 (2) 
14qi1.1-q13.  HPE4 L Holoprosencephaly-4, semilobar 142946 Ch ?Holoprosencephaly-4 (2) 
14q11.2 ICR2, LI P Ichthyosis congenita II, non-erythro- 242300 Fd Lamellar ichthyosis, 
matous lamellar ichthyosis autosomal recessive (2) 
14q11.2 TCRA © T-cell antigen receptor, o-polypeptide 186880 H,REa,A, Leukaemia/lymphoma, 14(Tera) 
REn T-cell (3) 
14q11.2-q13. =OPMD1 Me Oculopharyngeal muscular dystrophy-1 164300 Fd Oculopharyngeal muscular 
dystrophy-1 (2) 
14q12 MYH7,CMH1 C Myosin, heavy polypeptide-7, cardiac 160760 REa,RE,D,A Cardiomyopathy, familial 
muscle B hypertrophic, 1, 192600 (3); 
?central core disease, one 
form (3) 
14q13.1 NP Cc Nucleotide phosphorylase 164050 S,D Nucleoside phosphorylase _ 14(Np1,2) 
deficiency, immunodefi- 
ciency due to (3) 
14q21-q22 PYGL P Phosphorylase, glycogen, liver 232700 Reb Glycogen storage disease 12(Pygl) 
VI (1) 
14q22-q23.2 SPTB € Spectrin, B, erythrocytic 182870 REb,E,H, Elliptocytosis-3 (3); Spherory- 12(Sptb1) 
REa, A, RE ctosis-1 (3) 
14q22.1-q22.2 GCH1 P GTP cyclohydrolase 1 600225 Psh,A Phenylketonuria, atypical, 
due to GCH1 deficiency, 
233910 (1); dystonia, DOPA- 
responsive, 128230 (3) 
14q23-q24 ARVD P Arrhythmogenic right ventricular 107970 Fd Arrhythmogenic right ventri- 
dysplasia cular dysplasia (2) 
14q24-qter CTAAI L Cataract, anterior polar, 1 115650 Ch ?Cataract, anterior polar, I (2) 
14q24.3 AD3 € Alzheimer disease-3 104311 Fd Alzheimer disease-3 (2) 
14q24.3-q31 MJD (e Machado-Joseph disease 109150 Fd Machado-Joseph disease (2) 
14q24.3-q32.1 GALC (e Galactosylceraminidase 245200  REa,A,H,Fd Krabbe disease (3) 12(tw) 
14q24.3-qter SCA3 (G Spinocerebellar ataxia 3 (olivopontocere- 183085 Fd Spinocerebellar ataxia-3 (2) 
bellar ataxia 3, autosomal dominant) 
14q31 TSHR Ec Thyroid-stimulating hormone receptor 275200 REa,Fd,A Hypothyroidism, non- 12(Tshr) 
goitrous, due to TSH 
resistance (3); thyroid 
adenoma, hyperfunction- 
ing (3); Graves disease, 
275000 (1); hyperthroidism, 
congenital (3) 
14q32 CKBE y Creatine kinase, ectopic expression 1 23270 #F [Creatine kinase, brain type, 
ectopic expression of] (2) 
14q32 SIV IE Situs inversus viscerum 270100 H 2Situs inversus viscerum (2) _12(iv) 
14q32 USH1A,USH1 C Usher syndrome-1A 276900 Fd Usher syndrome, type 1A (2) 
14q32 VP, PPOX P Variegate porphyria 176200 F Porphyria variegata (2) 
ee 107280 REa,A, Fd a-1-antichymotrypsin defi 
: : ‘a, A, Fd, 18 E 
14q32.1 AACT G a-1-antichymotrypsin ae eG) ccosscula 
disease, occlusive (3) 
14q32.1 CBG Gc Corticosteroid-binding globulin 122500 A,REn [Transcortin deficiency] i) 
14q32.1 PCI,PLANH3 C Protein C inhibitor (plasminogen 227300 Psh,REn Protein C inhibitor deficiency 
Sige ase ig 107400 ES,A,D E ee cirrhosis (3) 12(Aat) 
nite ere - VoeAeD) mp 5 
ia ae . Teg ctac eae alec oa EM, Fd haemorrhagic diathesis due 
to ‘antithrombin’ Pitts- 
burgh (3); emphysema (3) 
14q32.1 Tek cz T-cell lymphoma-1 186960 Ch,RE Leukaemia/lymphoma, T-cell 


(2) 
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Location Locus symbol Status Title MIM# Method Disorder(s) Mouse locus 
14q32.33 IGH@ Cc Immunoglobulin heavy chain gene REa, A ?Combined variablehypo- _—_12(Igh) 
cluster gammaglobulinaemia (1) 
14q32.33 IGHR L Immunoglobulin heavy chain regulator 144120 FE ?Hyperimmunoglobulin G1 
syndrome (2) 
Chr.14 MPS3C L Sanfilippo disease, type IIIC 252930 Ch ?Sanfilippo disease, type IIIC 
(2) 
Chr.14 RMCH Pp. Rod monochromacy 216900 Ch Rod monochromacy (2) 
15q11 PWCR,PWS C Prader-Willi syndrome chromosome 176270 Ch, D Prader-Willi syndrome (2) 
region 
15q11-q13 AHO2 L Albright hereditary osteodystrophy-2 103581 D ?Albright hereditary osteodys- 
trophy-2 (2) 
15q11-q13 ANCR c Angelman syndrome chromosome 105830 Ch,D Angelman syndrome (2) 
region 
15q11-q13 ITO E Hypomelanosis of Ito 146150 Ch ?Hypomelanosis of Ito (2) 
15q11.1 SPG6 1 Spastic paraplegia 6 600363 Fd Spastic paraplegia-6 (2) 
15q11.2-q12 OCA2,P, PED, C Oculocutaneous albinism II (pink-eye 203200 D, REa, Fd Albinism, oculocutaneous, 7(p) 
D15S12 dilution (murine) homologue) type II (3); albinism, ocular, 
autosomal recessive (3) 
15q12 SNRPN ie, Small nuclear ribonucleoprotein 182279 REa,D ?Prader—Willi syndrome (1) —_7(Snrpn) 
polypeptide N 
15q14-q15 IVD P Isovaleryl Coenzyme A dehydrogenase 243500 Rea Isovalericacidaemia (3) 
15q15 EPB42 € Erythrocyte surface protein band 4.2 177070 A Spherocytosis, hereditary, 2(Epb4.2) 
Japanese type (3); 
?Hermansky—Pudlak 
syndrome, 203300 (1) 
15q15 SORD Ee Sorbitol dehydrogenase 182500 S,H,A, REa ?Cataract, congenital (2) 2(Sdh1) 
15q15.1-q21.1 LGMD2A ( Limb girdle muscular dystrophy 2A 253600 Fd,A Muscular dystrophy, limb- 
(autosomal recessive) girdle, type 2A (2) 
15q21 CDAN3, P Congenital dyserythropoietic anaemia, 105600 Fd Dyserythropoietic anaemia, 
CDA3 type III congenital, type III (2) 
15q21-q22 B2M (E B-2-microglobulin 109700 S$) Did Haemodialysis-related amyloi-2(B2m) 
dosis (1) 
15q21-q23 LIPC Gs Lipase, hepatic 151670 REa, A Hepatic lipase deficiency (8) 9(HI) 
15q21.1 CYP19,ARO CC Cytochrome P450, subfamily XIX 107910 REa, A, H ?Gynaecomastia, familial, 9(Cyp19) 
(aromatization of androgens) due to increased aromatase 
activity (1); virilization, 
maternal and fetal, from 
placental aromatase 
deficiency (3) 
15q21.1 FBN1,MFS1 = C Fibrillin-1 134797 A, Fd Marfan syndrome, 154700 (3) 2(Fbn1) 
15q22 PML, MYL P Promyelocytic leukaemia, inducer of 102578 Ch, RE Leukaemia, acute promyelo- 
cytic (2) 
15q22 TPM1,CMH3 C Tropomyosin 1 &% 191010 Fd Cardiomyopathy, familial 9(Tpm1) 
hypertrophic, 3, 115196 (3) 
15q22.3-q23 BBS4 P Bardet—Bied] syndrome 4 600374 Fd, LD Bardet—Bied] syndrome-4 (2) 
15q23-q24 HEXA,TSD C Hexosaminidase A (a-polypeptide) 272800 S,D,A Tay-Sachs disease (3);GM2- 9(Hexa) 
gangliosidosis, juvenile, adult 
(3); [Hex A pseudodeficiency] 
(1) 
15q23-q25 ETFA, GA2 1 Electron transfer flavoprotein, o- 231680 REa, A Glutaricaciduria, type IIA (1) 
polypeptide 
15q23-q25 FAH Cc Fumarylacetoacetase 276700 A, REa Tyrosinaemia, type I (3) 
15q26 IDDM3 P Insulin-dependent diabetes mellitus3 600318 Fd Diabetes mellitus, insulin- 
dependent, 3 (2) 
15q26.1 BLM, BS (e Bloom syndrome 210900 M,LD Bloom syndrome (2) 
Chr.15 TSK, SSS 18 Stiff skin syndrome 184900 H ?Stiff skin syndrome (2) 2(Tsk) 
Chr.15 XPF L Xeroderma pigmentosum, 278760 M ?Xeroderma pigmentosum, 
complementation group F type F (2) 
16pter-p13.3  HBA1 Cc Haemoglobin a-1 141800 HS Thalassaemias, o- (3); methae- 11(Hba) 
moglobinaemias, «- (3); 
erythremias, a- (3); 
Heinz body anaemias, o- (3) 
16pter-p13.3. HBHR,ATRI C o-thalassaemia/mental retardation 141750 Fd, RE o-thalassaemia/mental 
syndrome, type 1 retardation syndrome, 
type I (1) 
16p13.31- PKD1 c Polycystic kidney disease-1 173900 F, Fd, REn Polycystic kidney 217(Pkd1) 
p13.12 (autosomal dominant) disease-1 (3) 
16p13.3 CATM P Cataract, congenital, with micr- 156850 Ch Cataract, congenital, 
ophthalmia microphthalmia (2) 
16p13.3 PKDTS P Polycystic kidney disease, infantile 600273 RE Polycystic kidney disease, 


severe, with tuberous sclerosis 


infantile severe, with 
tuberous sclerosis (3) 


i 
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Location Locussymbol Status’ Title MIM#> Method: Disorder(s) Mouse locus 
16p13.3 RSTS G Rubinstein-Taybi syndrome 180849 Ch Rubinstein—Taybi syndrome 
(2) 
16p13.3 TSC2 eG Tuberous sclerosis-2 (tuberin) 191092 Fd, Ch, D, Tuberous sclerosis-2 (2) 17(Tsc2) 
REn 
16p13.3-p13.2_ CDG1 P Carbohydrate-deficient glycoprotein 212065 Fd Carbohydrate-deficient 
glycoprotein syndrome (2) 
16p13.11 SAH Pp’ SA (rat hypertension-associated) 145505 REa,A {?Hypertension, essential} (1) 
homologue 
16p13 HAGH, GLO2 C Hydroxyacyl glutathione hydrolase; 138760 S [Glyoxalase II deficiency] (1) 
glyoxalase II 
16p13 MEF, FMF Cc Mediterranean fever, familial 249100 Fd, LD Familial Mediterranean 
fever (2) 
16p12.3 SCNNI1B P Sodium channel, non-voltage-gated 1 177200 REa, FD Liddle syndrome (3) 
16p12 CLN3, BTS (Cc Ceroid-lipofuscinosis, neuronal-3, 204200 E Fd Batten disease (2) 
juvenile (Batten disease) 
16p11.2 SGLT2 P. Sodium-glucose transporter-2 182381 REa ?Renal glucosuria, 253100 (1) 
16q SCA4 J2 Spinocerebellar ataxia 4 600223 Fd Spinocerebellar ataxia, 
type 4 (2) 
16q12-q13.1 PHKB G Phosphorylase kinase, B-polypeptide 172490 REa,A ?Phosphorylase kinase 


deficiency of liver and 
muscle, 261750 (2) 


16q12.1 TBS L Townes-Brocks syndrome 107480 Ch ?Townes-Brocks syndrome —_16q13- 
(2) 
q22.1 CES1,SES1 P Carboxylesterase 1 (monocyte/ 114835 REa ?Monocyte carboxyesterase  8(Ces1) 
macrophage serine esterase 1) deficiency (1) 
16q21 BBS2 iP Bardet-Bied| syndrome 2 209900 Fd Bardet-Biedl syndrome 2 (2) 
16q21 CETP i? Cholesteryl ester transfer protein, 118470 REa, A [CETP deficiency] (3) 
plasma 
16q22 CBFB ie Core-binding factor, B-subunit 121360 Ch Myeloid leukaemia, acute, 
M4Eo subtype (2) 
16q22-q24 ALDOA € Aldolase A, fructose-bisphosphatase 103850 REa,REb,A Aldolase A deficiency (3) 
16q22.1 CDH UVOMNG Cadherin 1 (E-cadherin, uvomorulin) 192090 REa, D,Ch Endometrial carcinoma (3); 8(Um) 
ovarian carcinoma (3) 
16q22.1 CTM € Cataract, Marner type 116800 F Cataract, Marner type (2) 16q22.1 
LCAT € Lecithin-cholesterol acyltransferase 245900 E,LD,A,REa Norum disease (3);fish-eye  8(Lcat) 
disease (3) 
16q22.1-q22.3 TAT @ Tyrosine aminotransferase, cytosolic 276600 REa,A,H,D Tyrosinaemia, type II (3) 8(Tat) 
16q24 APRT (e Adenine phosphoribosyltransferase 102600 5, D Urolithiasis, 2,8-dihydroxy- _ 8(Aprt) 
adenine (3) 
16q24 CYBA G Cytochrome b-245, o-polypeptide 233690 REa, A Chronic granulomatous 
disease, autosomal, due to 
deficiency of CYBA (3) 
16q24.3 GALNS, ( Galactosamine (N-acetyl)-6-sulphate 253000 A,Psh Mucopolysaccharidosis [VA 
MPS4A sulfatase (3) 
Chr.16 ATP2A1 R ATPase, Ca** transporting, fast- 108730 REa Brody myopathy (1) 
twitch, 1 
Chr.16 CTH 1% Cystathionase 219500 S [Cystathioninuria] (1) 
17pter-p13 ASPA iP Aspartoacylase (aminoacylase-2) 271900 A Canavan disease (3) 
17pter-p12 GP1BA P Glycoprotein Ib, platelet, o-polypeptide 231200 A ace cael syndrome 
17pter-p12 PLI P o-2-plasmin inhibitor 262850 Psh he inhibitor deficiency 
17p13.3 BCPR L Breast cancer-related regulator of TP53_— 113721 D ?Breast cancer (1) 
17p13.3 MDCR,MDS_ C Miller-Dieker syndrome chromosome 247200 Ch,D Miller—Dieker lissencephaly _ 11(Mds) 
region syndrome (2) 
17p13.1 TP53 Cc Tumour protein p53 191170 REa, A,D Colorectal cancer, 114500 (3); 11(Trp53) 
Li-Fraumeni syndrome (3) 
17p12-q12 DENB3 1 Deafness, autosomal recessive 3 600316 Fd Deafness-3, neurosen- 
sory non-syndromic 
recessive (2) 
17p11.2 PMP22, C Peripheral myelin protein-22 118220 Fd,D,A Charcot—Marie-Tooth 11(Tr) 
CMT1A neuropathy, slow nerve 
conduction type Ia (3); 
neuropathy, recurrent, 
with pressure palsies, 
162500 (3); Dejerine-Sottas 
disease, PMP22 related, 
145900 (3) 
17p11.2 SMCR (C Smith—-Magenis syndrome chromosome 182290 Ch Smith—Magenis syndrome (2) 
region 


ee 
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Location Locus symbol Status’ Title MIM# Methods Disorder(s) Mouse locus 
17p RP13 P Retinitis pigmentosa-13 600059 Fd Retinitis pigmentosa-13 (2) 
17q PSS1 P Psoriasis susceptibility, familial, 1 177900 Fd Psoriasis susceptibility (2) 
17q11.2 NF1, VRNE, Cc Neurofibromatosis, type 1 (neurofibro- 162200 Fd,EM,Ch,F Neurofibromatosis, type I (3); 
Wss matosis, von Recklinghausen disease, Watson syndrome, 193520 (3) 
Watson syndrome) 
17q11.2 SLS P Sjogren—Larsson syndrome 270200 Fd, LD Sjogren—Larsson syndrome (2) 
17q11.2-q24 MHS2 P Malignant hyperthermia susceptibility 2 154275 Fd Malignant hyperthermia 
susceptibility 2 (2) 
17q12 RARA e Retinoic acid receptor, a-polypeptide 180240 A,Ch Leukaemia, acute promye- 11(Rara) 
locytic (1) 
17q12-q21 KRT9,EPPK C Keratin-9 144200 Fd, Rea Epidermolytic palmoplantar 
keratoderma (3) 
17q12-q21 KRT14, EBS3, P Keratin-14 148066 REa Epidermolysis bullosa sim- 
EBS4 plex, Koebnertype, 131900 
(3); epidermolysis bullosa 
simplex, Dowling—Meara 
type, 131670 (3); epider- 
molysis bullosa simplex 
Weber—Cockayne type, 
131800 (3) 
17q12-q21 PCHC1 iP Pachyonychia congenita 1 167210 Fd Pachyonychia congenita, 
(Jackson—Lawler type) Jackson—Lawler type (2) 
17q12-q21.33 ADL,DAG2, C Adhalin 600119 Psh, A Muscular dystrophy, 
LGMD2D Duchenne-like, type 2 (3) 
17q21 ACAC ACGE YE. Acetyl-Coenzyme A carboxylase 200350 A Acetyl-CoA carboxylase 
deficiency (1) 
17q21 BRCA1 E Breast cancer-1, early onset 113705 Fd Breast cancer-1, early onset (3); 
ovarian cancer, sporadic (3) 
17q21 G6PT € Glucose-6-phosphatase 232200 REa, REn Glycogen storage disease, 
type 1(3) 
17q21-q22 DDPAC P Disinhibition-dementia-Parkinsonism- 600274 Fd Disinhibition-dementia- 
amyotrophy complex Parkinsonism-amyotrophy 
complex (2) 
17q21-q22 GALK1 ‘e Galactokinase-1 230200 S,Ch,R,C Galactokinase deficiency (1) _ 11(Glk) 
17q21-q22 KRT10 ce Keratin-10 148080 REa,A,REn —_ Epidermolytic hyperkeratosis, 
113800 (3) 
17q21-q22 PENT,PNMT C Phenylethanolamine N-methyltrans- 171190 REa, Fd ?Hypertension, essential, 
ferase 145500 (1) 
17q21-q22 SLC4A1, (e Solute carrier family 4, anion exchanger, 109270 REa, RE, Fd, A [Acanthocytosis, one form] (3); 
AE1, EPB3 member 1 (erythrocyte membrane [elliptocytosis, Malaysian- 
protein band 3, Diego blood group) Melanesian type] (3); Sphe- 
rocytosis, hereditary (3) 
17q21.3-q22 MPO (e Myeloperoxidase 254600 REa,A,E,Ch, Myeloperoxidase deficiency 11(Mpo) 
G (3) 
17q21.31- COL1A1 Cc Collagen I, a-1 polypeptide 120150 C,M,A,REa Osteogenesis imperfecta,4  11(Cola1) 
q22.05 clinical forms, 166200, 
166210, 259420, 166220 (3); 
Ehlers—Danlos syndrome, 
type VIIA1, 130060 (3); 
osteoporosis, idiopathic, 
166710 (3) 
17q21.32 ITGA2B, GP2B, C Integrin, o 2b (platelet glycoprotein 273800 A, REb, REa, Glanzmann thrombasthenia, 
CD41B Ib of Ifb/IIa complex, antigen RE, F, LD type A (3); thrombocy- 
CD41B) topenia, neonatal alloim- 
mune (1) 
17q21.32 ITGB3,GP3A C Integrin, B-3 (platelet glycoprotein Ila; 173470 REa, REb,A, | Glanzmann thrombasthenia, 
antigen CD61) RE, ELD) type B (3) 
17q22-q24 CSHi, GSA, PLAC Chorionic somatomammotropin 150200 REa, A [Placental lactogen defi- 13(P11) 
hormone-1 ciency] (1) 
17q22-q24 GH1,GHN e Growth hormone-1 139250 REa, A, Fd Isolated growth hormone 11(Gh) 
deficiency, Illig type with 
absent GH and Kowarski 
+type with bioinactive GH 
(3) 
17q23 DCPIZAGEI ARG Dipeptidyl carboxypeptidase-1 106180 A,H, Fd {Myocardial infarction, 
(angiotensin I converting enzyme) susceptibility to} (3) 
17q23 GAA ( Glucosidase, acid o- 232300 oA; DE Glycogen storage disease, 
type II (3) 
17q23-qter APOH (© Apolipoprotein H (B-2-glycoprotein1) 138700 Fd, Rea [Apolipoprotein H defi- 11(Apoh) 
ciency] (3) 
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Location Locussymbol Status’ Title MIM#> Methods Disorder(s) Mouse locus 
17q23-qter TOC, TEC P Tylosis with oesophageal cancer 148500 Fd Tylosis with oesophageal 
cancer (2) 
17q23.1-q25.3  SCN4A,HYPP, C Sodium channel, voltage-gated, type 4, 170500 REa, Fd Hyperkalaemic periodic 
NACIA a. polypeptide paralysis (3); paramyotonia 
congenita, 168300 (3); myo- 
tonia congenita, atypical 
acetazolamide-responsive 
(3) 
CCAI1 Pr Cataract, congenital, cerulean type 115660 Fd Cataract, congenital, cerulean 
type (2) 
17q24.3-q25.1 CMD1, Cc Campomelic dysplasia-1 211970 Ch Campomelic dysplasia with 11(Ts, Sox9) 
SOX9, SRA1 (sex reversal, autosomal, 1) autosomal sex reversal (3) 
17q25 ACOX P Acyl-Coenzyme A oxidase 264470 A Adrenoleucodystrophy, 
pseudoneonatal (2) 
17q25 RSS Ly Russell-Silver syndrome 180860 Ch Russell-Silver syndrome (2) 
18pter-ql1 HPE1 L Holoprosencephaly-1, alobar 236100 Ch ?Holoprosencephaly-1 (2) 
18p11.32 MCL L Multiple hereditary cutaneous 150800 Ch ?Leiomyomata, multiple 
leiomyomata hereditary cutaneous (2) 
18p11.2 MC2R G Melanocortin-2 receptor (ACTH 202200 A,Psh Glucocorticoid deficiency, 18(Mc2r) 
receptor) due to ACTH unrespon- 
siveness (1) 
18q11-q12 LCFS2 L Lynch cancer family syndrome II 114400 1 ?Lynch cancer family syn- 
drome II (2) 
18q11-q12 NPC e Niemann-Pick disease, type C 257220 Ch,H,Fd,M Niemann-—Pick disease, 18(spm) 
type C (2) 
18q11.2-q12.1 TTR, PALB Cc Transthyretin (prealbumin) 176300 REa, A Amyloid neuropathy, 18 (Palb) 
familial, several allelic 
types (3); [Dystransthy- 
retinaemic hyperthyrox- 
inaemia] (3); amyloidosis, 
senile systemic (3); carpal 
tunnel syndrome, familial 
(3) 
18q21.1-q22 FEO P Familial expansile osteolysis 174810 Fd Familial expansile osteolysis (2) 
18q21.3 BCL2 (S B-cell CLL/lymphoma-2 151430 Ch,RE,REn  Leukaemia/lymphoma, B-cell, 1(Bcl2) 
2 (2) 
18q21.3 FECH, € Ferrochelatase 177000 A, Reb Protoporphyria, erythro- 
FEE poietic (3); protoporphyria, 
erythropoietic, recessive, 
with liver failure (3) 
18q21.3 FVTl P Follicular lymphoma, variant 136440 RE Lymphoma /leukaemia, B-cell, 
translocation 1 variant (1) 
18q22-qter MS1 L Multiple sclerosis 126200 Fd, LD {?Multiple sclerosis, suscepti- 
bility to} (2) 
18q22.1 GTS L Gilles de la Tourette syndrome 137580 Ch ?Tourette syndrome (2) 
18q23 CYB5 (e Cytochrome b5 250790 Psh,REa,A § Methaemoglobinaemia due to 
cytochrome b5 deficiency 
(3) 
18q23.3 DCC (® Deleted in colorectal carcinoma 120470 D,RE Colorectal cancer (3) 18(Dec) 
19p13.3 FUT6 P Fucosyltransferase 6 (a (1,3) 136836 Psh, REn Fucosyltransferase-6 
fucosyltransferase) deficiency (3) 
19p13.3 HHC2,FHH2 P Hypocalciuric hypercalcaemia-2 145981 Fd Hypocalciuric hyper- 
calcaemia, type II (2) 
19p13.3 TBXA2R € Thromboxane A2 receptor 188070 Psh,Fd,A,H Bleeding disorder due to 10(Tbxa2r) 
defective thromboxane 
A2 receptor (3) 
19p13.3 TCF3, E2A Cc Transcription factor-3 (E2ZAimmuno- 147141 REa, A Leukaemia, acute lym- 
globulin enhancer-binding factors phoblastic (1) 
12/E47) - : 
19p13.3-p13.2_ AMH, HIF |e Anti-Mullerian hormone 261550 REa, A Persistent Mullerian duct 10(Amh) 
syndrome (3) 
19p13.3-p13.2 ATHS,ALP =P Atherosclerosis susceptibility 108725 Fd {Atherosclerosis, suscep- 
(lipoprotein associated) tibility to} (2) 
19p13.3-q13.2. C3 e Complement component-3 ea Y ieee enone hgh oe 
19p13.3-q13.2 EPOR G Erythropoietin receptor , Pies 
-CoenzymeA dehydrogenase 231670 REa, A Glutaricacidaemia, type I (3) 
ie ae “ eee ree: ets 147670 REa, A, REb Leprechaunism (3); diabetes 8(Insr) 
mellitus, insulin-resistant, 


with acanthosis nigricans 
(3); Rabson—Mendelhall 
syndrome (3) 
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Location Locussymbol Status’ Title MIM#> Method: Disorder(s) Mouse locus 
19p13.2-p13.1  LDLR, FHC Cc Familial hypercholesterolaemia 143890 F, REa,A Hypercholesterolaemia, 9(Ldlr) 
(LDL receptor) familial (3) 
19p13.2-p13.1 LYL1 Gc Lymphoblastic leukaemia derived 151440 Ch,A Leukaemia, T-cell acute 8(Lyl1) 
sequence-1 lymphoblastoid (2) 
19p13.2-q13.3_ LPSA, le Oncogene liposarcoma (DNA Segment, 164953 A Liposarcoma (1) 
D198381E single copy, expressed, probes MC15, 
MC6) 
19p13.1 RFX1 Cc Regulatory factor (trans-acting) 1 600006 A Severe combined immunode- 
(influences HLA class II expression) ficiency, HLA class II- 
negative type, 209920 (2) 
19p13 MHP1 e Migraine, hemiplegic 1 141500 Fd Migraine, hemiplegic-1 (2) 
19p APCA,CAPA P Cerebellar ataxia, paroxysmal 108500 Fd Cerebellar ataxia, paroxysmal 
acetazolamide-responsive acetazolamide-responsive 2) 
19p EXT3 P. Exostoses (multiple) 3 600209 Fd Exostoses, multiple, type 3 
(2) 
19cen-q12 MANB Cc Mannosidase, alpha B, lysosomal 248500 S, Psh Mannosidosis (1) 
19cen-q13.11 | PEPD G Peptidase D (prolidase) 170100 S, EH, Fd Prolidase deficiency (3) 7(Pep4) 
19cen-q13.2  AD2 Cc Alzheimer disease-2 (late-onset) 104310 Fd Alzheimer disease-2, late 
onset (2) 
19q12 CADASIL, P Cerebral autosomal dominant arteri- 125310 Fd Cerebral arteriopathy with 
CASIL opathy with subcortical infarcts and subcortical infarcts and 
leukoencephalopathy leukoencephalopathy (2) 
19q12 EDM1,MED C Epiphyseal dysplasia, multiple 1 132400 Fd Epiphyseal dysplasia, 
multiple 1 (2) 
19q12 PSACH € Pseudoachondroplastic dysplasia 177170 Fd Pseudoachondroplastic 
dysplasia (2) 
19q12-q13.1 NPHS1,CNE P Nephrosis 1, congenital, Finnish type 256300 Fd Nephrosis, congenital, 
NFC Finnish (2) 
19q13 BCL3 Gc B-cell CLL/lymphoma-3 109560 Ch, S/H Leukaemia/lymphoma, 7(Bcl3) 
B-cell, 3 (2) 
19q13.1 GPI Cc Glucose phosphate isomerase; 172400 S,D,A Haemolytic anemia due to 7(Gpil) 
neuroleukin glucosephosphate 
isomerase deficiency (3); 
hydrops fetalis, one form 
(1) 
19q13.1 RYRI1, (S Ryanodine receptor-1 (skeletal) 180901 A, Fd,H Malignant hyperthermia 7(Ryr) 
MHS, CCO susceptibility-1, 145600 
(3); central core disease, 
117000 (3) 
19q13.1-q13.2 AKT2 ly Murine thymoma viral (v-akt) 164731 A Ovarian carcinoma, 167000 (2) 
homologue-2 
19q13.1-q13.2_ BCKDHA, (S Branched chain keto acid dehydro- 248600 REa,REb,A — Maple syrup urine disease, 
MSUD1 genase E1, & polypeptide type Ia (3) 
19q13.1-q13.2 CORD2,CRD P Cone rod dystrophy 2 (autosomal 120970 Fd Cone-rod retinal dystrophy (2) 
dominant) 
19q13.2 APOE Ec Apolipoprotein E 107741 EF REa, LD, Hyperlipoproteinaemia, 7(Apoe) 
A, Fd type III (3) 
19q13.2 APOC2 (€ Apolipoprotein C-II 207750 REa, F, LD, Hyperlipoproteinaemia, 
A, Fd type Ib (3) 
19q13.2-q13.3 DM gE Dystrophia myotonica 160900 E Fd Myotonic dystrophy (3) 7(Dm) 
19q13.2-q13.3 ERCC2,EM9 C Excision repair cross complementing 126340 S,RE,M Xerodermal pigmentosum, 7(Erce2) 
rodent repair deficiency, comple- group D, 278730 (3) 
mentation group-2 
19q13.2-q13.3 HB1,PFHB1 P Heart block, progressive familial, typeI 113900 Fd Heart block progressive 
familial, type I (2) 
19q13.2-q13.3 LIG1 [€ Ligase I, DNA, ATP-dependent 126391 REa,A DNA ligase I deficiency (3) 
19q13.2-q13.3. PVS Ec Polio virus sensitivity 173850 S, A, REa {Polio, susceptibility to} (2) 9(Pvs) 
19q13.3 ETFB @ Electron transfer flavoprotein, 130410 REa, A Glutaricaciduria, type IIB (3) 
B-polypeptide 
19q13.3 GYS1,GYS Cc Glycogen synthase 138570 REa, A {Non-insulin dependent 
diabetes mellitus, suscepti- 
bility to} (2) 
19q13.32 LHB G Luteinizing hormone, -polypeptide 152780 RE Hypogonadism, hyper- 7(Lhb) 
gonadotropic (3); ?male 
pseudohermaphroditism 
due to defective LH (1) 
19q13.4 RP11 RP Retinitis pigmentosa-11 600138 Fd Retinitis pigmentosa-11 (2) 
(autosomal dominant) 
Chr.19 BCT2 IP Branched chain aminotransferase-2 113530 S ?Hypervalinaemia or hyper- 


leucineisoleucinaemia (1) 
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a SSeS 
Locus symbol Status’ Title 


Location 


MIM# = Method Disorder(s) Mouse locus 
20pter-p12 PRNP,PRIP = C Prion protein (p27-30) 176640 REa,REb,A —Creutzfeldt-Jakob disease,  2(Prnp) 
123400 (3); Gerstmann— 
Straussler disease, 137440 
(3); insomnia, fatal familial 
(3) 
20p13 AVP, Cc Arginine vasopressin (neurophysin II, 192340 REa,RE,Fd _ Diabetes insipidus, neurohy- 2(Avp) 
AVRP, VP antidiuretic hormone) pophyseal, 125700 (3) 
20p12 BMP2, BMP2A C Bone morphogenetic protein-2 112261 H, REa, A ?Fibrodysplasia ossificans 2(Bmp2a) 
progressiva (1) 
20p12-cen THBD,THRM C Thrombomodulin 188040  REb,A Thrombophilia due to throm- 
bomodulin defect (3) 
20p11.2 AGS, AHD Cc Alagille syndrome 118450 Ch,D Alagille syndrome (2) 
(arteriohepatic dysplasia) 
20p11 CST3 fe Cystatin C 105150 REa, A Cerebral amyloid angiopathy 2(Cst3) 
(3) 
20p ITPA Ee Inosine triphosphatase-A 147520 S [Inosine triphosphatase 2(Itp) 
deficiency] (1) 
20q11.2 GHRF (S Growth hormone releasing factor; 139190 REa, REb,Ch, ?Isolated growth hormone 
somatocrinin Fd,A deficiency due to defect in 
GHRF (1); gigantism due to 
GHRF hypersecretion (1) 
20q13 MODY1 eC Maturity-onset diabetes of the young, 125850 Fd MODY, type I (2) 
typel 
20q13.1 PPGB,/GSL, — ‘€ Protective protein for B-galactosidase 256540 S, A, Fd Galactosialidosis (3) 2(Ppgb) 
NGBE, GLB2 
20q13.11 ADA C Adenosine deaminase 102700 S, D, REa, Severe combined immuno- 2(Ada) 
FA, Fd deficiency due to ADA 
deficiency (3); hemolytic 
anemia due to ADA excess 
(1) 
20q13.2 GNASI1, € Guanine nucleotide-binding protein 139320 REa,H,A,Fd Pseudohypoparathyroidism, 2(Gnas) 
GNAS, GPSA (G protein), o-stimulating activity type Ia, 103580 (3); 
McCune—Albright poly- 
ostotic fibrous dysplasia, 
174800 (3); somatotro- 
phinoma (3) 
20q13.2-q13.3 CHRNA4, ec Cholinergic receptor, nicotinic, 118504 REa, REn, Epilepsy, benign neonatal, 2(Acra4) 
EBN1 a polypeptide-4 A, Fd type I, 121200 (3) 
20q13.2-q13.3  FA1,FA,FACA P Fanconi anaemia-1 227650 Fd Fanconi anaemia-1 (2) 
20q13.31 PEK (c Phosphoenolpyruvate carboxykinase-1 261680 REa, A, Fd ?Hypoglycaemia due to PCK1 2(Pck1) 
(soluble) deficiency (1) 
21q11.2 MST I Myeloproliferative syndrome, transient 159595 Ch ?Leukaemia, transient (2) 
21q21.3- APP, AAA, € Amyloid  (A4) precursor protein 104760 REa,A,Fd,RE Amyloidosis, cerebroarterial, 16(App) 
q22.05 CVAP Dutch type (3); Alzheimer 
disease, APP-related (3); 
schizophrenia, chronic (3) 
21q22.1 HCS Pp Holocarbyoxylase synthetase 253270 Psh, A Multiple carboxylase defi- 
ciency, biotin-responsive (3) 
21q22.1 SOD1, ALS1 (c Superoxide dismutase-1, soluble 147450 S,D, Fd Amytrophic lateral sclerosis, 16(Sod1) 
due to SOD1 deficiency, 
105400 (3) 
21q22.3 APECED ie Autoimmune polyglandular disease, 240300 Autoimmune polyglandular 
type! disease, type I (2) 
21q22.3 CBFA2,AML1 C Core-binding factor, runt domain, 151385 Ch, Fd Leukaemia, acute myeloid (3) 
subunit (aml1 oncogene) ae 
21q22.3 CBS (C Cystathionine B-synthase 236200 S,D,A, Fd Homocystinuria, Bé6-respon- 17(Cbs) 
sive and non-responsive 
types (3) 
21q22.3 DSCR € Down syndrome (critical region) 190685 Ch Down syndrome @ 
21q22.3 EPM1 ie Epilepsy, progressive myoclonic 1 254800 Fd, LD Epilepsy, progressive 
myoclonus (2) 
21q22.3 IVGB2,Ep18, C Integrin, B-2 (antigen CD18 (p95), 116920 S,A, Fd Leukocyte adhesion defi- 7(Ly15) 
LCAMB, lymphocyte function-associated ciency (1) 
LAD antigen-1; macrophage antigen, 
orl aan i id H lyticanaemiadueto 17(Pfk1) 
21q22.3 PFKL (. Phosphofructokinase, liver type 171860 S,D,F seers : acne 
deficiency (1) 
22q11 CECR,CES  C Cat eye syndrome 115470 Ch,A,D oe eye Leia (2) iam 
22q11 CTHM L Conotruncal cardiac anomalies 217095 D ? ie cardiac 
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Table VII.1 Continued. 


Location Locus symbol Status’ Title MIM#> Method: Disorder(s) Mouse locus 
22q11 DGCR, Cc DiGeorge syndrome chromosome 188400 Ch, D DiGeorge syndrome (2); Velo- 
DGS, VCF region (velocardiofacial syndrome) cardiofacial syndrome, 
192430 (2) 
22q11 HCF2, HC2 E Heparin cofactor II 142360 REb, REa Thrombophilia due to heparin 
cofactor II deficiency (3) 
22q11 NAGA Cc Acetylgalactosaminidase, a-N- 104170 5,Ch Schindler disease (3); Kanzaki 
(a-galactosidase B) disease (3) 
22q11.1-q11.2 GGT1,GTG ie y-glutamyltransferase-1 231950 A,S,E, RE Glutathioninuria (1) 
22q11.12 GGT2 1p y-glutamyltransferase-2 137181 REn [Gamma-glutamyltransferase, 
familial high serum] (2) 
22q11.2-qter SGLT1 P Sodium-glucose transporter-1 182380 Rea Glucose/ galactose malab- 
sorption (3) 
22q11.2-qter TCN2,TC2 (E Transcobalamin II 275350 E,S,D Transcobalamin II deficiency 11(Ten2) 
(3) 
22q11.21 BCR, CML, iC Breakpoint cluster region 151410 Ch, RE Leukaemia, chronicmyeloid 10(Bcr) 
PHL (3) 
22q12 EWSCR,EWS C Ewing sarcoma breakpoint region-1 133450 Ch Ewing sarcoma (3); Neuroep- 
ithelioma (2) 
22q12.1-q13.2 TIMP3,SFD C Tissue inhibitor of metalloproteinase-3 188826 REa, A, Fd Sorsby fundus dystrophy, 
136900 (3) 
22q12.2 NF2 G Neurofibromatosis-2 101000 RE, F, Ch, Neurofibromatosis, type 2 11(Nf2) 
(bilateral acoustic neuroma) D, Fd (3); meningioma, NF2- 
related (3); Schwannoma, 
sporadic (3); 
22q12.3-q13.1 PDGFB, SIS (c Platelet-derived growth factor, 190040 REa, Fd Meningioma, SIS-related (3) 15(Pdgfb) 
B polypeptide (oncogene SIS) 
22q13-qter ACR P Acrosin 102480 REa ?Male infertility due to 15(Acr) 
acrosin deficiency (2) 
22q13.1 ADSL Cc Adenylosuccinate lyase 103050 S, REa, A Adenylosuccinase deficiency 
(1); autism, succinylpurinae- 
mic (3) 
22q13.1 CYP2P@, Cc Cytochrome P450, subfamily IID 124030 EFd,Psh,A __{?Parkinsonism, suscepti- 15(Cyp2d) 
CYP2D; bility to} (1); Debrisoquine 
P450C2D sensitivity (3) 
22q13.1-qter SFD rR Sorsby fundus dystrophy 136900 Fd Sorsby fundus dystrophy (2) 
22q13.31-qter ARSA € Arylsulphatase A 250100 S,D Metachromatic leukody- 15(As2) 
strophy (3) 
22q13.31-qter DIA1 Cc Diaphorase (NADH); cytochrome b- 250800 S, REa Methaemoglobinaemia, 15(Dia1) 
5 reductase enzymopathic (3) 
Xpter-p22.32 GCFX,SS 1p Growth control factor, X-linked 312865 Fd Short stature (2) 
Xpter-p22.2 CFND L Craniofrontonasal dysplasia 304110 Ch ?Craniofrontonasal dysplasia 
(2) 
Xp22.32 CSF2RA G Colony-stimulating factor-2 receptor, 306250 A Leukaemia, acute myeloid, 
a, low-affinity (granulocyte- 19(Csf2ra) M2 type (1) 
macrophage) 
Xp22.32 STSSARSCI. iG Steroid sulphatase, microsomal 308100 ES,D Ichthyosis, X-linked (3); X,Y (Sts) 
SSDD placental steroid sulphatase 
deficiency (3) 
Xp22.31 DHOF, FODH P Dermal hypoplasia, focal 305600 Ch Focal dermal hypoplasia (2) 
Xp22.3 ARSE, CDPX1, C Arylsulphatase E 302950 D, Fd Chondrodysplasia punctata, 
CDPXR X-linked recessive (3) 
Xp22.3 KAL1,KMS, C Kallmann syndrome-1 sequence 308700 F, Fd, D, Kallman syndrome (3) 
ADMLX REa, Ren 
Xp22.3 OA1 (e Ocular albinism-1, Nettleship—Falls 300500 EFd Ocular albinism, Nettleship- 
type Falls type (3) 
Xp22.3 OASD P Ocular albinism and sensorineural 300650 Fd Ocular albinism with 
deafness sensorineural deafness (2) 
Xp22.3-p22.1 AMELX, Cc Amelogenin 301200 REa, A, Fd Amelogenesis imperfecta (3) X-(Ame1) 
AMG, AIH1, 
AMGX 
Xp22.3-p21.1 NHS Cc Nance—Horan cataract-dental 302350 Fd Nance-Horan syndrome (2) _?X(Xcat) 
syndrome 
Xp22.3-p21.1 POLA ‘e Polymerase (DNA directed), o 312040 S 2N syndrome, 310465 (1) X(Pola) 
Xp22.3-p22.1 RS (E Retinoschisis 312700 F, Fd Retinoschisis (2) 
Xp22.2 CMTX2 P Charcot—Marie—Tooth disease, 302801 Fd Charcot—Marie—Tooth neuro- 
pathy, X-linked-2, reces- 
sive (2) 
Xp22.2 FCPX, FCP P F-cell production 305435 FE, Fd Heterocellular hereditary 
persistence of fetal 


haemoglobin, Swiss type (2) 
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Table VII.1 Continued. 


Location 


Locus symbol Status Title MIM# Method: Disorder(s) Mouse locus 
Xp22.2 HOMG, R Hypomagnesaemia, secondary 307600 X/A Hypomagnesaemia, X-linked 
Moke hypocalcaemia primary (2) 
Xp22.2 MLS, MITF P Microphthalmia with linear skin defects 309801 Ch ?Microphthalmia with linear 
(microphthalmia-associated tran- skin defects (2) 
scription factor) 
Xp22.2-p21.2 KFSD Fi Keratosis follicularis spinulosa 308800 Fd Keratosis follicularis spinulosa 
decalvans decalvans (2) 
Xp22.2-p22.1 CLS P Coffin-Lowry syndrome 303600 Fd Coffin—Lowry syndrome (2) 
Xp22.2-p22.1 HYP,HPDR1 C Hypophosphataemia, vitamin D 307800 Fd Hypophosphataemia, X(Hyp) 
resistant rickets hereditary (2) 
Xp22.2-p22.1 PDHA1, (€ Pyruvate dehydrogenase, El-o, 312170 REa, A Pyruvate dehydrogenase X(Pdha1) 
PHE1A polypeptide-1 deficiency (3) 
Xp22.2-p22.1 PHK,PHKA2 C Phosphorylase kinase deficiency, liver 306000 Fd, REa, A Glycogen storage disease, 
(glycogen storage disease type VIII) X-linked hepatic (2) 
Xp22.2-p22.1 PRTS,MRXS1 P Partington syndrome (mental retar- 309510 Fd Mental retardation, X-linked, 
dation, X-linked, syndromic-1, with syndromic-1, with dystonic 
dystonic movements, ataxia, and movements, ataxia, and 
seizures) seizures (2) 
Xp22.2-p22.1 SEDL,SEDT C Spondyloepiphyseal dysplasia, late 313400 Fd Spondyloepiphyseal dysplasia 
tarda (2) 
Xp22.11-p21.2 GDXyY, P, Gonadal dysgenesis, XY female type 306100 ECh Gonadal dysgenesis, XY female 
Xp22 AGMX2, R Agammaglobulinaemia, X-linked 2 300310 Fd Agammaglobulinaemia, X(Xid) 
XLA2, IMD6 (with growth hormone deficiency) type 2, X-linked (2) 
Xp22 AIC (c Aicardi syndrome 304050 X/A,Ch Aicardi syndrome (2) 
Xp22 GY, HYP2 L Hereditary hypophosphataemia II 307810 H ?Hypophosphataemia with  X(Gy) 
(gyro equivalent) deafness (2) 
Xp22 MRX1 Cc Mental retardation, X-linked-1, 309530 FE, Fd,D Mental retardation, X- 
non-dysmorphic linked-1, non-dysmorphic 
(2) 
Xp22-p21 PDR P Pigment disorder, reticulate 301220 Fd Partington syndrome II (2) 
Xp21.3-p21.2 DAX1, Gc DSS, AHC, X gene 1 300200 D, Fd Adrenal hypoplasia, congenital, 
AHC, AHX with hypogonadotrophic 
hypogonadism (3) 
Xp21.3-p21.2 GK (€ Glycerol kinase deficiency 307030 D, Fd Glycerol kinase deficiency (2) 
Xp21.3-p21.2 RP6 1g Retinitis pigmentosa-6 (X-linked 312612 Fd ?Retinitis pigmentosa-6 (2) 
recessive) 
Xp21.2 DFN4 PB. Deafness 4, congenital sensorineural 600203 Fd Deafness 4, congenital sen- 
sorineural (2) 
Xp21.2 DMD,BMD C Dystrophin (muscular dystrophy, 310200 X/A,Fd,D Duchenne muscular dystro- X(Dmd) 
Duchenne and Becker types) phy (3); Becker muscular 
dystrophy (3); Cardiomy- 
opathy, dilated, X-linked (3) 
Xp21.2-p21.1 XK € Kell blood group precursor 314850 ED McLeod phenotype (3) 
Xp21.1 GYBBJEGDiy EE Cytochrome b-245, B-polypeptide 306400 ED Chronic granulomatous X(Cybb) 
disease, X-linked (3) 
Xp21.1 OTC iE Ornithine transcarbamylase 311250 L,REa,A,D Ornithine transcarbamylase _X(spf; Otc) 
deficiency (3) 
Xp21.1 RP3 Cc Retinitis pigmentosa-3 (X-linked 312610 Fd,D Retinitis pigmentosa-3 (2) 
recessive) . ; 
Xp21.1-q22 WTS,MRXS6 P Wilson-Turner syndrome (mental 309585 Fd Mental retardation, X-linked, 
retardation, X-linked, syndromic-6, syndromic-6, with gynaeco- 
_ with gynaecomastia and obesity) mastia and obesity (2) 
Xp21 GTD L Gonadotropin deficiency 306190 D ?Gonadotropin deficiency (2); 
2cryptorchidism (2) 
Xp21 SRS, MRSR 1p Snyder—Robinson X-linked mental 309583 Fd Mental retardation, Snyder— 
retardation syndrome Robinson type (2) ; 
Xpi1.4 NDP, ND G Norrie disease (pseudoglioma) 310600 Fd, D Norrie disease (3), Exudative 
vitreoretinopathy, X-linked, 
305390 (3) 
Xp11.4-p11.23 ATED,OA2 Cc Aland island eye disease (ocular 300600 ED, Fd Ocular albinism, Forsius— 
albinism, Forsius—Eriksson type) Eriksson type (2) oe, 
Xp11.4-p11.23 PFC, PFD @ Properdin P factor, complement 312060 Fd, REa, A epee ee (Pfc 
Xp11.3 COD, PEDXs 1G Cone dystrophy-1 (X-linked) 304020 Fd Progressive cone csuaecie (2) 
Xp11.3 CSNB1 I Congenital stationary night blindness-1 310500 Fd ie ener sone a 
Xp11.3 RP2 (C Retinitis pigmentosa-2 (X-linked 312600 Fd Retinitis pigmentosa-2 (2) 


recessive) 
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Table VII.1 Continued. 


EEE 


Location Locus symbol Status’ Title MIM# Method: Disorder(s) Mouse locus 
Xp11.23 MAOA (Gc Monoamine oxidase A 309850 Fd, REa, D, Brunner syndrome (3) X(Maoa) 
A, REn 

Xp11.23- NPHL2, 1? Nephrolithiasis 2, X-linked 600248 Fd Nephrolithiasis 2, X-linked (2) 

p1.22 DENTS (Dent syndrome) 
Xp11.23- WAS, Cc Wiskott—Aldrich syndrome 301000 Fd,X/A Wiskott—Aldrich syndrome (3); 

p11.22 IMD2, THC Thrombocytopenia, X-linked, 

313900 (3) 

Xp11.22 CLCK2 Pp. Chloride channel, voltage-gated, K2 600260 RE ?Dent disease, 310468 (2) 

Xp11.22 NPHL1,XRN P Nephrolithiasis 1 (X-linked) 310468 Fd Nephrolithiasis, X-linked, 
with renal failure (2) 

Xp11.21 ALAS2, CG Aminolaevulinate, 5-, synthase-2 301300 Ch, REa, Anaemia, sideroblastic/ 

ASB, ANH1 A, Fd hypochromic (3) 

Xp11.21 FGDY, AAS (e Faciogenital dysplasia 305400 X/A, Fd Aarskog-Scott syndrome (3) 
(Aarskog-Scott syndrome) 

Xp11.21 IP1, IP (e Incontinentia pigmenti-1, sporadic type 308300 X/A Incontinentia pigmenti, X(Td) 

sporadic type (2) 

Xp11.2 RCCP2 P Renal cell carcinoma, papillary 312390 Ch,S Renal cell carcinoma, papil- 
lary, 2 (2) 

Xp11.2 SSRC, SSX Gc Sarcoma, synovial 312820 Ch, RE,A Sarcoma, synovial (3) Xp11 

MRXA 18s Mental retardation, X-linked non- 309545 Fd ?Mental retardation, X-linked 
specific, with aphasia non-specific, with aphasia 
(2) 

Xpl1-q21 PRS, MRXS2. PP Prieto syndrome (mental retardation, 309610 Fd Mental retardation, X-linked, 
X-linked, syndromic-2, with dys- syndromic-2, with dysmor- 
morphism and cerebral atrophy) phism and cerebral atrophy 

(2) 

Xp11-q21.3 SHS, MRXS3_ PP Sutherland—Haan syndrome(mental 309470 Fd Mental retardation, X-linked, 
retardation, X-linked, syndromic-3, syndromic-3, with spastic 
with spastic diplegia) diplegia (2) 

Xp Cer L Cataracts, congenital total 302200 Fd ?Cataract, congenital total (2) 

Xp RTT, RTS L Rett syndrome 312750 Ch ?Rett syndrome (2) 

Xp SMAX2 P Spinal muscular atrophy, X-linked 600199 Fd Spinal muscular atrophy X- 
lethal infantile linked lethal infantile (2) 

Xq11-q12 AR, DHTR, € Androgen receptor (dihydrotestos- 313700 S,Fd,REa,A Androgen insensitivity, X(Tfm) 

TFM, SBMA, terone receptor) several forms (3); spinal and 
KD bulbar muscular atrophy 
of Kennedy, 313200 (3); 
prostate cancer (3); perineal 
hypospadias (3); breast 
cancer, male, with Reifen- 
stein syndrome (3) 

Xql11-q12 MRX2 js Mental retardation, X-linked-2, 309540 Fd ?Mental retardation, X-linked- 
non-dysmorphic 2, non-dysmorphic (2) 

Xq12-q13 ATP7A, € ATPase, Cu transporting, 309400 Fe, X/A,H Menkes’ disease (2) X(Mnk) 

MNK, MK o-polypeptide 

Xq12-q13.1 DYT3 c Torsion dystonia-Parkinsonism, 314250 Fd Torsion dystonia-Parkinson- 
Filipino type ism, Filipino type (2) 

Xq12-q21 JMS 13 Juberg—Marsidi syndrome 309590 Fd Juberg—Marsidi syndrome (2) 

Xq12.2-q13.1 EDA,HED a Anhidrotic ectodermal dysplasia 305100 X/A,H, Fd Anhidrotic ectodermal X(Ta) 
dysplasia (2) 

Xq13 ASAT (Tb Anaemia, sideroblastic, with 301310 Fd ?Anaemia, sideroblastic, with 

spinocerebellar ataxia spinocerebellar ataxia (2) 
Xq13 IL2RG, Cc Interleukin-2 receptor, y 308380 Fd Severe combined immunode- X(l2rg) 
SCIDX1, ficiency X-linked, 300400 (3); 
SCIDX, combined immunodefi- 
IMD4 ciency, X-linked, moder- 
ate, 312863 (3) 

Xq13 PGK1,PGKA C Phosphoglycerate kinase-1 311800 S,R,REb,Fd Haemolytic anaemia due to X(Pgk1) 
PGK deficiency (3); myo- 
globinuria/haemolysis due 
to PGK deficiency (3) 

Xq13 PHKA1 € Phosphorylase kinase, muscle, 311870 REa,A,REn Muscle glycogenosis (3) X(Phka) 

a-polypeptide 

Xq13 RAD54,XH2, C RAD54 (Saccharomyces cerevisiae) 600254 RE, Fd o-thalassaemia/mental retar- X(Xh2) 

ATRX, ATR2 dation syndrome, type 2, 
301040 (3) 
Xq13-q21 Wws P Wieacker—Wolff syndrome 314580 Fd Wieacker—Wolff syndrome (2) 
Xq13-q22 MCS, MRXS4_ PP Miles—Carpenter syndrome (mental 309605 Fd Mental retardation, X-linked, 


retardation, X-linked, syndromic-4, 
with congenital contractures and low 
fingertip arches) 


syndromic-4, with con- 
genital contractures and 
low fingertip arches (2) 
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Location 


Locus symbol Status’ Title 


MIM# ~—- Methods Disorder(s) Mouse locus 
Xq13.1 GBY OBA ae Gap junction protein, B-1,32 kD 304040 REa, Fd Charcot-Marie-Tooth neu- —_- X(Gjb11) 
CMTX1 (connexin 32) ropathy, X-linked-1, 
dominant, 302800 (3) 
Xqi3.1 open CCG2, C Ribosomal protein S4, X-linked 312760 A,REa,REn Turner syndrome (1) X(Rps4x) 
Xq21 AHDS 12 Allan-Herndon-Dudley mental 309600 Fd Allan-Herndon syndrome 
retardation syndrome (2) 
Xq21.1 POU3F4, P POU domain, class 3, transcription 600420 Fd,D,H,REn Deafness, conductive, with 
DFN3 factor 4 stapes fixation, 304400 (3) 
Xq21.1-q21.31 CPX (€ Cleft palate and/or ankyloglossia 303400 Fd,D Cleft palate, X-linked (2) 
Xq21.2 CHM, TCD G Choroideraemia 303100 Fd,LD,D,A, Choroideraemia (3) 
Ch, X/A 
Xq21.3-q22 BTK, AGMX1, C Bruton agammaglobulinaemia tyrosine 300300 H,Fd,A Agammaglobulinaemia, X(xid, Btp) 
IMD1, XLA, kinase type 1, X-linked (3); ?XLA 
AT and isolated growth hor- 
mone deficiency, 
307200 (3) 
Xq21.3-q22 MGC1,MGCN P Megalocornea-1, X-linked 309300 Fd Megalocornea, X-linked (2) 
Xq21.3-q22 PHP,GHDX  L Panhypopituitarism, X-linked 312000 Fd ?Panhypopituitarism, X- 
linked (2) 
Xq22 COL4A5, G Collagen, type IV, alpha-5 polypeptide 303630 REa, A, Fd Alport syndrome, 301050 (3); 
ATS, ASLN Leiomyomatosis-nephro- 
pathy syndrome, 308940 (1) 
Xq22 COL4A6 Cc Collagen, type IV, a-6 polypeptide 303631 REn, A Leiomyomatosis, diffuse (1); 
?Alport syndrome, X-linked, 
type 2 (1) 
Xq22 GLA G Galactosidase, 301500 5,R,A, Fd Fabry disease (3) X(Ags) 
Xq22 PLP, PMD G Proteolipid protein; Pelizaeus— 312080 REa, A, Ch, Pelizaeus—Merzbacher X-P1p(jp)) 
Merzbacher disease R, Fd disease (3); Spastic para- 
plegia 2, 312920 (3) 
Xq22 TBG Cc Thyroxine-binding globulin 314200 REa, A [Euthyroidal hyper- and 
hypothyroxinaemia] (1) 
Xq22-q24 PRPS1 € Phosphoribosyl pyrophosphate 311850 S,R,REa,A — Phosphoribosy] pyrophos- 
synthetase-1 phate synthetase-related 
gout (3) 
Xq22-q28 AIH3 jis Amelogenesis imperfecta-3, hypo- 301201 Fd ?Amelogenesis imperfecta-3, 
maturation or hypoplastic type hypoplastic type (2) 
Xq22.1 PIGA P. Phosphatidylinositol glycan class A 311770 A Paroxysmal nocturnal haemo- X(Piga) 
globinuria (3) 
Xq25 LYP, IMD5, (€ Lymphoproliferative syndrome 308240 Fd,D Lymphoproliferative syn- 
XLP, XLPD drome, X-linked (2) 
Xq25-q26 HTX1 P Heterotaxy-1 306955 Fd Heterotaxy, X-linked visceral 
(2) 
Xq25-q26.1 TAS P Thoracoabdominal syndrome 313850 Fd ea coabdominal syndrome 
2 
Xq25-q27 PGS,MRXS5——P Pettigrew syndrome (mental retar- 304340 Fd Mental retardation, X-linked, 
dation, X-linked, with Dandy— syndromic-5, with Dandy— 
Walker malformation, basal ganglia Walker malformation, basal 
disease, and seizures) ganglia disease, and seizures 
(2) 
Xq26 CD40LG, Cc CD40 antigen ligand (hyper-IgM 308230 Fd, A, Psh Immunodeficiency, X-linked, X(CD40l) 
HIGM1 IGM syndrome) with hyper-IgM (3) 
Xq26 GUST R Gustavson mental retardation syn- 309555 Fd Gustavson syndrome (2) 
drome (with microcephaly, optic 
atrophy, deafness) ; 
Xq26 SDYS, SGB Cc Simpson dysmorphia syndrome 312870 Fd Simpson-Golabi-Behmel 
syndrome (2) 
Xq26 SHFM2, P Split hand/foot malformation, type 313350 Fd Split hand /foot malformation, 
SHFD2 (ectrodactyly) 2 type 2 (2) 
Xq26-q27 BFLS 1? Borjeson—Forssman-Lehmann syn- 301900 Fd Borjeson—Forssman—Lehmann 
drome syndrome (2) 
Xq26-q27 HPT,HPTX, P Hypoparathyroidism 307700  -Fd ae ay X- 
HYPX 
x F1, POF iE Premature ovarian failure-1 311360 Ch Ovarian failure, premature (2) ' 
eye ae Cc Hypoxanthine phosphoribosyl- 308000 S,M,C,R, Lesch-Nyhan syndrome (3); X(Hprt) 
transferase REa, Fd HPRT-related gout (3) ,; 
Xq26-qter INDX P Immunoneurologic syndrome, X- 600486 Fd Wood's neuroimmunologic 
linked, of Wood, Black, and Norbury syndrome (2) 
Xq26.1 OCRL, @ Oculocerebrorenal syndrome of Lowe — 309000 X/A, Fd Lowe syndrome (3) 
LOCR, 
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Location Locus symbol Status? Title MIM#> Method: Disorder(s) Mouse locus 
Xq26.3-q27.1 ADFN,ALDS P Albinism-deafness syndrome 300700 Fd Albinism-deafness syn- 
drome (2) 
Xq27-q28 ANOP1 1h Anophthalmos-1 (with mental 301590 F ?Anophthalmos-1 (2) 
retardation but without anomalies) 
Xq27.1-q27.2 F9,HEMB Cc Coagulation factor IX (plasma thrombo- 306900 REa, A, Fd, Haemophilia B (3) X(C£9) 
plastic component) D, X/A,RE 
Xq27.3 FMRI, FRAXA C Fragile X mental retardation-1 309550 Ch,E,Fd,Re Fragile X syndrome (3) X(Fmr1) 
Xq28 ALD Cc Adrenoleucodystrophy 300100 FE Fd,D Adrenoleucodystrophy (3); X(Ald) 
adrenomyeloneuropathy (2) 
Xq28 AVPR2,DIR, C Arginine vasopressin receptor-2 304800 Fd,S, REa, Diabetes insipidus, nephro- 
Dil, ADHR (nephrogenic diabetes insipidus) Psh genic (3) 
Xq28 CBBM,BCM C Blue-monochromatic colour-blindness 303700 F, Fd, RE Colour-blindness, blue 
(blue cone monochromacy) monochromatic (3) 
Xq28 CDPX2, L Chondrodysplasia punctata-2, X-linked 302960 H Chondrodysplasia punctata, X(Bpa) 
CPXD, CPX dominant (Happle syndrome) X-linked dominant (2) 
Xq28 DKC Cc Dyskeratosis congenita 305000 Fd Dyskeratosis congenita (2) 
Xq28 EFE2;BIHS  € Endocardial fibroelastosis-2 (Barth 302060 Fd Endocardial fibroelastosis-2 
syndrome; cardioskeletal myopathy (2); Barth syndrome (2) 
with neutropenia and abnormal 
mitochondria) 
Xq28 EMD,EDMD C Emery—Dreifuss muscular dystrophy 310300 E,Fd,H,REn Emery-—Dreifuss muscular 
dystrophy (3) 
Xq28 F8C,HEMA = C Coagulation factor VIlIc, procoagulant 306700 FE, Fd, REa,A, Haemophilia A (3) X(Cf8) 
component RE 
Xq28 FRAXE, FMR2 P Fragile site, X-linked, E 309548 Ch, REn Mental retardation, X-linked, 
FRAXE type (3) 
Xq28 FRAXF P Fragile site, folic acid type, rare, fra(X) 600226 Ch, RE Mental retardation, X-linked, 
(q28) FRAXF type (3) 
Xq28 G6PD,G6PD1 C Glucose-6-phosphate dehydrogenase 305900 E,S,REb,RE G6PD deficiency (3); Favism X(G6pd) 
(3); haemolytic anaemia due 
to G6PD deficiency (3) 
Xq28 HMS1,GAY1_ L Homosexuality, male 306995 Fd [?Homosexuality, male] (2) 
Xq28 GCP, CBD ic Green cone pigment 303800 ERE,A,Fd — Colour-blindness, deutan (3) X(Rsvp) 
Xq28 IDS, MPS2, iE Iduronate 2-sulphatase (Hunter 309900 X/A, Fd, F, Mucopolysaccharidosis II (3) (Ids) 
SIDS syndrome) RE 
Xq28 IP2 Cc Incontinentia pigmenti-2 308310 Fd Incontinentia pigmenti, X(?Str) 
(familial, male-lethal type) familial (2) 
Xq28 LICAM, G L1 cell adhesion molecule 308840 A,RE,H,Fd Hydrocephalus due to aque- X(Licam) 
CAML1, ductal stenosis, 307000 (3); 
HSAS1 MASA syndrome, 303350 
(3); spastic paraplegia, 
312900 (3) 
Xq28 MAFD2,MDX L Major affective disorder-2 309200 F ?Manic-depressive illness, 
X-linked (2) 
Xq28 MRSD,CHRS_ P Mental retardation-skeletal dysplasia 309620 Fd Mental retardation-skeletal 
dysplasia (2) 
Xq28 MRX3 P Mental retardation, X-linked-3 309541 Fd Mental retardation, X-linked- 
3 (2) 
Xq28 MTM1,MTMX C Myotubular myopathy-1 310400 Fd Myotubular myopathy, X- 
linked (2) 
Xq28 MYP1, BED P Myopia-1 (Bornholm eye disease) 310460 Fd Myopia-1 (2); Bornholm eye 
disease (2) 
Xq28 OPD1 P. Otopalatodigital syndrome, type I 311300 Fd Otopalatodigital syndrome, 
type I (2) 
Xq28 RCP, CBP ¢ Red cone pigment 303900 F, RE,A, Fd Colour-blindness, protan (3) X(Rsvp) 
Xq28 TKC,TKCR  C Torticollis, keloids, cryptorchidism 314300 X/A Goeminne TKCR syndrome 
and renal dysplasia (2) 
Xq28 WSN, BGMR_  P Waisman syndrome (basal ganglion 311510 Fd Waisman parkinsonism- 
disorder with mental retardation) mental retardation syn- 
drome (2) 
Yp11.3 TDE, SRY Cc Testis determining factor 480000 Ch, Fd Gonadal dysgenesis, Yp(Tdy, Sry) 
(sex-determining relation Y) XY type (3) 
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Table VII.2 The morbid anatomy of the human genome (alphabetically by disorder). 


a. 


Disorder Location 
Aarskog-Scott syndrome (3) Xp11.21 
Abetalipoproteinaemia (3) 2p24 
[Acanthocytosis, one form] (3) 17q21-q22 
Acatalasaemia (3) 11p13 
Acetyl-CoA carboxylase deficiency (1) 17q21 


Achondrogenesis-hypochondrogenesis, type II (3) 
Achondroplasia, 100800 (3) 

?Acrocallosal syndrome (2) 

?Acrofacial dysostosis, Nager type (2) 

ACTH deficiency (1) 

Acyl-CoA dehydrogenase, long chain, deficiency of (3) 
Acyl-CoA dehydrogenase, medium chain, deficiency of (3) 
Acyl-CoA dehydrogenase, short chain, deficiency of (3) 
Adenylosuccinase deficiency (1) 

Adrenal hyperplasia, congenital, due to 11-B-hydroxylase deficiency (3) 
Adrenal hyperplasia, congenital, due to 17-a-hydroxylase deficiency (3) 
Adrenal hyperplasia, congenital, due to 21-hydroxylase deficiency (3) 
Adrenal hypoplasia, congenital, with hypogonadotropic hypogonadism (3) 
Adrenocortical carcinoma, hereditary, 202300 (2) 
Adrenoleucodystrophy (3) 

Adrenoleucodystrophy, pseudoneonatal (2) 
Adrenomyeloneuropathy (2) 

[AFP deficiency, congenital] (1) 

Agammaglobulinaemia, type 1, X-linked (3) 
Agammaglobulinaemia, type 2, X-linked (3) 

Aicardi syndrome (2) 

Alagille syndrome (2) 

Albinism, brown, 203290 (1) 

Albinism, ocular, autosomal recessive (3) 

Albinism, oculocutaneous, type IA (3) 

Albinism, oculocutaneous, type II (3) 

Albinism-deafness syndrome (2) 

?Albright hereditary osteodystrophy-2 (2) 

Alcohol intolerance, acute (3) 

Aldolase A deficiency (3) 

Aldosteronism, glucocorticoid-remediable (3) 

Alkaptonuria (2) 

Allan—Herndon syndrome (2) _ 

Alpha-1-antichymotrypsin deficiency (3) 
Alpha-ketoglutarate dehydrogenase deficiency (1) 
Alpha-thalassaemia/mental retardation syndrome, type 1 (1) 
Alpha-thalassaemia/ mental retardation syndrome, type 2, 301040 (3) 
Alport syndrome, 301050 (3) 

Alport syndrome, autosornal recessive, 203780 (3) 

Alport syndrome, autosomal recessive, 203780 (3) 

?Alport syndrome, X-linked, type 2 (1) 

Alzheimer disease, APP-related (3) 

Alzheimer disease-2, late onset (2) 

Alzheimer disease-3 (2) 

Amelogenesis imperfecta (3) 

?Amelogenesis imperfecta-3, hypoplastic type (2) 

[AMP deaminase deficiency, erythrocytic] (3) 

Amyloid neuropathy, familial, several allelic types (3) 
Amyloidosis, cerebroarterial, Dutch type (3) 

Amyloidosis, Finnish type, 105120 (3) 

Amyloidosis, hereditary renal, 105200 (3) 

Amyloidosis, lowa type, 107680.0010 (3) 


12q13.11-q13.2 
4p16.3 
12p13.3-p11.2 
9q32 

2p25 
2q34-q35 
1p31 
12q22-qter 
22q13.1 

8q21 

10q24.3 
6p21.3 
Xp21.3-p21.2 
11p15.5 

Xq28 

17q25 

Xq28 
4q11-q13 
Xq21.3-q22 
Xp22 

Xp22 
20p11.2 

9p23 
15q11.2-q12 
11q14-q21 
15q11.2-q12 
Xq26.3-q27.1 
15q11-q13 
12q24.2 
16q22-q24 
8q21 

3q2 

Xq21 

14q32.1 
7p13-p11.2 
16pter-p13.3 
Xqi3 

Xq22 

2q36 
2q36-q37 
Xq22 
21q21.3-q22.05 
19cen-q13.2 
1424.3 
Xp22.3-p22.1 
Xq22-q28 
lipter-p13 
18q11.2-q12.1 
21q21.3-q22.05 
9q34 

4q28 

11q23 
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Table VII.2 Continued. 


Disorder Location 
Amyloidosis, renal, 105200 (3) Chr.12 
{?Amyloidosis, secondary, susceptibility to} (1) 1q21-q23 
Amyloidosis, senile systemic (3) 18q11.2-q12.1 
Amyotrophic lateral sclerosis, juvenile (2) 2q33-q35 
Amytrophic lateral sclerosis, due to SOD1 deficiency, 105400 (3) 21q22.1 
?Anal canal carcinoma (2) 11q22-qter 
Analbuminaemia (3) 4q11-q13 
Androgen insensitivity, several forms (3) Xql1-q12 
?Anaemia, megaloblastic, due to DHFR deficiency (1) 5q11.2-q13.2 
Anaemia, pernicious, congenital, due to deficiency of intrinsic factor (1) Chr.11 
?Anaemia, sideroblastic, with spinocerebellar ataxia (2) Xql13 
Anaemia, sideroblastic/hypochromic (3) Xp11.21 
Aneurysm, familial, 100070 (3) 2q31 
Angelman syndrome (2) 15q11-q13 
Angio-oedema, hereditary (3) 11q11-q13.1 
Anhidrotic ectodermal dysplasia (2) Xql12.2-q13.1 
Aniridia (3) 11p13 
Ankylosing spondylitis (2) 6p21.3 
?Anophthalmos-1 (2) Xq27-q28 
Anterior segment mesenchymal dysgenesis (2) 4q28-q31 
Antithrombin III deficiency (3) 1q23-q25 
Apert syndrome, 101200 (3) 10q26 
Apnoea, postanaesthetic (3) 3q26.1-q26.2 
ApoA-l and apoC-IIl deficiency, combined (3) 11q23 
Apolipoprotein B-100, ligand-defective (3) 2p24 
[Apolipoprotein H deficiency] (3) 17q23-qter 
Argininaemia (3) 6q23 
Argininosuccinicaciduria (3) 7cen-q11.2 
Arrhythmogenic right ventricular dysplasia (2) 14q23-q24 
Aspartylglucosaminuria (3) 4q32-q33 
Ataxia with isolated vitamin E deficiency, 277460 (3) 8q 
Ataxia-telangiectasia (2) 11q22.3 
{Atherosclerosis, susceptibility to} (2) 19p13.3-p13.2 
{Atherosclerosis, susceptibility to} (2) 1q23-q25 
?{Atherosclerosis, susceptibility to} (3) 8p21-p12 
Atopy (2) 11q12-q13 
Atransferrinaemia (1) 3q21 

Atrial septal defect, secundum type (2) 6p21.3 
Autism, succinylpurinaemic (3) 22q13.1 
Autoimmune polyglandular disease, type I (2) 21q22.3 
Bardet—Bied] syndrome 1 (2) 11q13 
Bardet—Biedl syndrome 2 (2) 16q21 
Bardet—Bied] syndrome 3 (2) 3p13-p12 
Bardet—Bied] syndrome-4 (2) 15q22.3-q23 
Bare lymphocyte syndrome, type I, due to TAP2 deficiency (1) 6p21.3 
Barth syndrome (2) Xq28 

?Basal cell carcinoma (2) 9q31 

Basal cell carcinoma (3) 5q13.3 

Basal cell naevus syndrome (2) 9q31 

Batten disease (2) 16q12 
Becker muscular dystrophy (3) Xp21.2 
Beckwith-Wiedemann syndrome (2) 11 pter-p15.4 
Bernard—Soulier syndrome (1) 17pter-p12 
{Beryllium disease, chronic, susceptibility to} (3) 6p21.3 
3-B-hydroxysteroid dehydrogenase, type II, deficiency (3) 1p13.1 
Biotinidase deficiency (1) 3p25 


Bladder cancer, 109800 (3) 


13q14.1-q14.2 
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Disorder 


Location 
Bleeding disorder due to defective thromboxane A2 receptor (3) 19p13.3 
Blepharophimosis, epicanthus inversus and ptosis (2) 3q22-q23 
Bloom syndrome (2) 15q26.1 
Borjeson—Forssman—Lehmann syndrome (2) Xq26-q27 
Bornholm eye disease (2) Xq28 
?Brachydactyly type E (2) 2937 
Brachydactyly-mental retardation syndrome (2) 2q37 
Branchio-otorenal dysplasia (2) 8q13.3 
?Breast cancer (1) 17p13.3 
Breast cancer (1) 6q25.1 
Breast cancer 2, early onset (2) 13q12-q13 
Breast cancer, ductal (2) 1p36 
Breast cancer, ductal (2) Chr.13 
Breast cancer, male, with Reifenstein syndrome (3) Xqll-ql2 
Breast cancer-1, early onset (3) 17q21 
Breast cancer-3 (2) 11q23 
Brody myopathy (1) Chr.16 
Brunner syndrome (3) Xp11.23 
Burkitt lymphoma (3) 8q24.12-q24.13 
Butterfly dystrophy, retinal (3) 6p21.1-cen 
?C1q deficiency (1) 1p36.3-p34.1 
?C1q deficiency (1) 1p36.3-p34.1 
C1r/Cl1s deficiency, combined (1) 12p13 
Clr/C1s deficiency, combined (1) 12p13 
C2 deficiency (3) 6p21.3 
C3 deficiency (3) 19p13.3-p13.2 
C3b inactivator deficiency (1) 4q25 
C4 deficiency (3) 6p21.3 
C4 deficiency (3) 6p21.3 
C5 deficiency (1) 9q34.1 
C6 deficiency (1) 5p13 
C7 deficiency (1) 5p13 
C8 deficiency, type I (2) 1p32 
C8 deficiency, type II (3) 1p32 
C9 deficiency (1) 5p13 
Campomelic dysplasia with autosomal sex reversal (3) 17q24.3-q25.1 
Canavan disease (3) 17pter-p13 
Carbamoylphosphate synthetase I deficiency (3) 2q33-q36 


Carbohydrate-deficient glycoprotein syndrome (2) 
Carboxypeptidase B deficiency (1) 
?Cardiomyopathy (1) 

Cardiomyopathy, dilated, X-linked (3) 
Cardiomyopathy, familial dilated, with conduction defect (2) 
Cardiomyopathy, familial hypertrophic, 1, 192600 (3) 
Cardiomyopathy, familial hypertrophic, 2, 115195 (3) 
Cardiomyopathy, familial hypertrophic, 3, 115196 (3) 
Cardiomyopathy, familial hypertrophic, 4 (2) 
?Carnitine acetyltransferase deficiency (1) 
?Carnitine palmitoyltransferase I deficiency (2) 
Carnitine palmitoyltransferase II deficiency (3) 
Carpal tunnel syndrome, familial (3) 

Cartilage-hair hypoplasia (2) 

Cat eye syndrome (2) 

?Cataract, anterior polar, I (2) 

?Cataract, congenital (2) 

Cataract, congenital, cerulean type (2) 

?Cataract, congenital total (2) 


16p13.3-p13.2 
Chr.13 

2q35 

Xp21.2 
1p11-ql1 
14q12 

1q3 

15q22 
11p13-q13 
9q34.1 

lig 

1p32 
18q11.2-q12.1 
9p13-q11 
22q11 
14q24-qter 
15q15 

CCA1 

Xp 
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Disorder Location 
Cataract, congenital, with late-onset corneal dystrophy (3) 11p13 
Cataract, congenital, with microphthalmia (2) 16p13.3 
Cataract, Coppock-like (3) 2q33-q35 
Cataract, Marner type (2) 16q22.1 
Cataract, zonular pulverulent-1 (2) 1q2 
Cavernous angiomatous malformations (2) 7ql1-q22 
CD3, ¢-chain, deficiency (1) 1q23-q25 
[CD4(+) lymphocyte deficiency] (2) 12pter-p12 
CD59 deficiency (3) 11p13 
Central core disease, 117000 (3) 19q13.1 
?Central core disease, one form (3) 14q12 
Centrocytic lymphoma (2) 11q13 
Cerebellar ataxia, paroxysmal acetazolamide-responsive (2) 19p 
Cerebellar ataxia with retinal degeneration (2) 3p21.1-p12 
Cerebral amyloid angiopathy (3) 20p11 
Cerebral arteriopathy with subcortical infarcts and leucoencephalopathy (2) 19q12 
Cerebrotendinous xanthomatosis (3) 2q33-qter 
Cerebrovascular disease, occlusive (3) 1432.1 
Ceroid lipofuscinosis, neuronal-1, infantile (2) 1p32 
Ceroid-lipofuscinosis, neuronal, variant late infantile form (2) 13q21.1-q32 
Cervical carcinoma (2) 11q13 
[CETP deficiency] (3) 16q21 
Charcot-Marie-Tooth disease, type II (2) 1p36-p35 
Charcot-Marie-Tooth disease, type IVA (2) 8q13-q21.1 
Charcot-Marie-Tooth neuropathy, slow nerve conduction type Ia (3) 17p11.2 
Charcot-Marie-Tooth neuropathy, slow nerve conduction type Ib, 118200 (3) 1q22 
Charcot—Marie-Tooth neuropathy, X-linked-1, dominant, 302800 (3) Xq13.1 
Charcot-Marie-Tooth neuropathy, X-linked-2, recessive (2) Xp22.2 
Chloride diarrhoea, congenital (2) 7q31 
Cholesteryl ester storage disease (3) 10q24-q25 
?Chondrodysplasia punctata, rhizomelic (2) 4p16-p14 
Chondrodysplasia punctata, X-linked dominant (2) Xq28 
Chondrodysplasia punctata, X-linked recessive (3) Xp22.3 
Choroideraemia (3) Xq21.2 
Chronic granulomatous disease, autosomal, due to deficiency of CYBA (3) 16q24 
Chronic granulomatous disease due to deficiency of NCF-1 (3) 7q11.23 
Chronic granulomatous disease due to deficiency of NCF-2 (1) 1q25 
Chronic granulomatous disease, X-linked (3) Xp21.1 
{Chronic infections, due to opsonin defect} (3) 10q11.2-q21 
Citrullinaemia (3) 9q34 


Cleft palate, X-linked (2) 

Cleidocranial dysplasia (2) 

CMO I] deficiency (3) 

Cockayne syndrome-2, late onset, 216410 (2) 
Coffin—Lowry syndrome (2) 

Cohen syndrome (2) 

?Colon cancer (1) 

Colon cancer, familial non-polyposis, type 1 (3) 
Colour-blindness, blue monochromatic (3) 
Colour-blindness, deutan (3) 
Colour-blindness, protan (3) 
Colour-blindness, tritan (3) 

Colorectal adenoma (1) 

Colorectal cancer (1) 

Colorectal cancer, 114500 (3) 

Colorectal cancer (3) 

Colorectal cancer (3) 


Xq21.1-q21.31 
6q21 

8q21 

10q11 
Xp22.2-p22.1 
8q22-q23 
7q22-q31.1 
2p16-p15 
Xq28 

Xq28 

Xq28 
7q31.3-q32 
12p12.1 
12p12.1 

Wp oye hil 
18q23.3 

5q21 
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Disorder Location 
Colorectal cancer (3) 5q21-q22 
Colorectal cancer, familial non-polyposis type 2 (3) 3p21.3 
Colton blood group (3) 7p14 
Combined C6/C7 deficiency (1) 5p13 
Combined immunodeficiency, X-linked, moderate, 312863 (3) Xqi3 
?Combined variable hypogammaglobulinaemia (1) 14q32.33 


Cone-rod retinal dystrophy (2) 

Congenital bilateral absence of vas deferens (3) 
?Conotruncal cardiac anomalies (2) 

Contractural arachnodactyly, congenital (3) 
Coproporphyria (3) 

Cornea plana congenita, recessive (2) 

Corneal dystrophy, combined granular/lattice type (2) 
Corneal dystrophy, Groenouw type I (2) 

Corneal dystrophy, lattice type I, 122200 (2) 

?Cornelia de Lange syndrome (2) 

{Coronary artery disease, susceptibility to} (1) 

Cortisol resistance (3) 

CRI deficiency (1) 

?Craniofrontonasal dysplasia (2) 

Craniosynostosis, type 1 (2) 

Craniosynostosis, type 2 (3) 

[Creatine kinase, brain type, ectopic expression of] (2) 
Creutzfeldt-Jakob disease, 123400 (3) 

Crigler—Najjar syndrome, type I, 218800 (3) 

Crouzon craniofacial dysostosis, 123500 (3) 
?Cryptorchidism (2) 

?Cutis laxa, marfanoid neonatal type (1) 
[Cystathioninuria] (1) 

Cystic fibrosis (3) 

Cystinuria, 220100 (3) 

Darier disease (keratosis follicularis) (2) 

Deafness 4, congenital sensorineural (2) 

Deafness, autosomal non-syndromic sensorineural, 2 (2) 
Deafness, conductive, with stapes fixation, 304400 (3) 
Deafness, low-tone (2) 

Deafness, neurosensory, AR, 1 (2) 

Deafness, non-syndromic, recessive, 2 (2) 

Deafness-3, neurosensory non-syndromic recessive (2) 
Debrisoquine sensitivity (3) 

Dejerine—Sottas disease, myelin P(0)-related, 145900 (3) 
Dejerine-Sottas disease, PMP22 related 145900 (3) 
?Dent disease, 310468 (2) 
Dentatorubro-pallidoluysian atrophy (3) 
Dentinogenesis imperfecta-1 (2) 

Denys-Drash syndrome (3) 

Diabetes insipidus, nephrogenic (3) 

Diabetes insipidus nephrogenic, autosomal recessive (3) 
Diabetes insipidus, neurohypophyseal, 125700 (3) 
Diabetes mellitus, insulin-dependent, 3 (2) 

Diabetes mellitus, insulin-dependent, 4 (2) 

?Diabetes mellitus, insulin-dependent, 7 (2) 

?Diabetes mellitus, insulin-dependent, neonatal (2) 
?Diabetes mellitus, insulin-dependent-1 (2) 

Diabetes mellitus, insulin-resistant, with acanthosis nigricans (3) 
Diabetes mellitus, rare form (1) 

Diastrophic dysplasia (3) 


19q13.1-q13.2 
7q31.2 
22q11 
5q23-q31 
3q12 

12q21 
5q22-q33.3 
5q22-q33.3 
5q22-q33.3 
3q26.3 
6q27 

5q31 

1q32 
Xpter-p22.2 
7p21.3-p21.2 
5q34-q35 
14q32 
20pter-p12 
Chr.2 
10q26 
Xp21 
7q31.1-q31.3 
Chr.16 
7q31.2 
2p21 
12q23-q24.1 
Xp21.2 
1p32 
Xq21.1 
5q31-q33 
13q12 
11q13.5 
17p12-q12 
22q13.1 
1q22 
17p11.2 
Xp11.22 
12pter-p12 
4q13-q21 
11p13 
Xq28 
12q13 
20p13 
15q26 
11q13 
2q31 

Chr.6 
6p21.3 
19p13.2 
11p15.5 
5q31-q34 
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Disorder Location 
?Dicarboxylicaminoaciduria, 222730 (1) 9p24 
DiGeorge syndrome (2) 22q11 
Diphenylhydantoin toxicity (1) 1p11-qter 
{Diphtheria, susceptibility to} (1) 5q23 
Disinhibition-dementia-Parkinsonism-amyotrophy complex (2) 17q21-q22 
Distal arthrogryposis-1 (2) 9p21-q21 
DNA ligase I deficiency (3) 19q13.2-q13.3 
Dopamine-f-hydroxylase deficiency (1) 9q34 
Down syndrome (1) 21q22.3 
?Dubin-Johnson syndrome (2) 13q34 
Duchenne muscular dystrophy (3) Xp21.2 
[Dysalbuminaemic hyperthyroxinaemia] (3) 4q11-q13 
[Dysalbuminaemic hyperzincaemia] (3) 4q11-q13 
Dysautonomia, familial (2) 9q31-q33 
Dyserythropoietic anaemia, congenital, type III (2) 15q21 
Dysfibrinogenaemia, o types (3) 4q28 
Dysfibrinogenaemia, B types (3) 4q28 
Dysfibrinogenaemia, y types (3) 4q28 
Dyskeratosis congenita (2) Xq28 
Dyslexia, specific, 2 (2) 6p21.3 
Dysplasminogenaemic thrombophilia (1) 6q26 
Dysprothrombinaemia (3) 11p11-q12 


Dystonia, DOPA-responsive, 128230 (3) 
[Dystransthyretinaemic hyperthyroxinaemia] (3) 

?EEC syndrome (2) 

Ehlers—Danlos syndrome, type III (3) 

Ehlers—Danlos syndrome, type IV, 130050 (3) 
Ehlers—Danlos syndrome, type unspecified (3) 
Ehlers—Danlos syndrome, type VI, 225400 (3) 
Ehlers—Danlos syndrome, type VIIA1, 130060 (3) 
Ehlers—Danlos syndrome, type VIIA2, 130060 (3) 
?Ehlers—Danlos syndrome, type X (1) 

[Elliptocytosis, Malaysian-Melanesian type] (3) 
Elliptocytosis-1 (3) 

Elliptocytosis-2 (3) 

Elliptocytosis-3 (3) 

Emery—Dreifuss muscular dystrophy (3) 

Emphysema (3) 

Emphysema due to o-2-macroglobulin deficiency (1) 
Emphysema-cirrhosis (3) 

Endocardial fibroelastosis-2 (2) 

Endometrial carcinoma (3) 

Enolase deficiency (1) 

?Eosinophilic myeloproliferative disorder (2) 
Epidermolysis bullosa dystrophica, dominant, 131750 (3) 
Epidermolysis bullosa dystrophica, recessive, 226600 (3) 
Epidermolysis bullosa, Herlitz junctional type, 226700 (3) 
Epidermolysis bullosa, Ogna type (2) 

Epidermolysis bullosa simplex, Dowling—Meara type, 131670 (3) 
Epidermolysis bullosa simplex, Dowling—Meara type, 131760 (3) 
Epidermolysis bullosa simplex, Koebner type, 131900 (3) 
Epidermolysis bullosa simplex, Koebner type, 131900 (3) 
Epidermolysis bullosa simplex, Weber—Cockayne type, 131800 (3) 
Epidermolysis bullosa, Weber—Cockayne type, 131800 (3) 
Epidermolytic hyperkeratosis, 113800 (3) 

Epidermolytic hyperkeratosis, 113800 (3) 

Epidermolytic palmoplantar keratoderma (3) 


14q22.1-q22.2 
18q11.2-q12.1 
7q11.2-q21.3 
2q31 

2q31 
9q34.2-q34.3 
1p36.3-p36.2 
17q21.31-q22.05 
7q22.1 

2q34 
17q21-q22 
1p36.2-p34 
1q21 
14q22-q23.2 
Xq28 

14q32.1 
12p13.3-p12.3 
14q32.1 

Xq28 

16q22.1 
Ipter-p36.13 
12p13 

3p21.3 
3p21.3 
1q25-q31 
8q24 
17q12-q21 
12q11-q13 
12q11-q13 
17q12-q21 
17q12-q21 
12q11-q13 
12q11-q13 
17q21-q22 
17q12-q21 
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Disorder 


Location 


Epilepsy, benign neonatal, type 2 (2) 

Epilepsy, benign neonatal, type I, 121200 (3) 
Epilepsy, juvenile myoclonic (2) 

Epilepsy, partial (2) 

Epilepsy, progressive myoclonus (2) 

Epilepsy, progressive, with mental retardation (2) 
Epiphyseal dysplasia, multiple 1 (2) 

Epiphyseal dysplasia, multiple 2 (2) 

Episodic ataxia/myokymia syndrome, 160120 (3) 
Epithelioma, self-healing, squamous 1, Ferguson—Smith type (2) 
?Erythraemia (1) 

Erythraemias, o- (3) 

Erythraemias, B- (3) 

Erythroblastosis fetalis (1) 

[Erythrocytosis, familial], 133100 (3) 
Erythrokeratodermia variabilis (2) 

[Euthyroidal hyper- and hypothyroxinaemia] (1) 
Ewing sarcoma (3) 

Exertional myoglobinuria due to deficiency of LDH-A (3) 
Exostoses, multiple, type 1 (2) 

?Exostoses, multiple, type 2 (2) 

Exostoses, multiple, type 3 (2) 

Exudative vitreoretinopathy, X-linked, 305390 (3) 
Fabry disease (3) 

Facioscapulohumeral muscular dystrophy 1A (2) 
Factor H deficiency (1) 

Factor V deficiency (1) 

Factor VII deficiency (3) 

Factor X deficiency (3) 

Factor XI deficiency (3) 

Factor XII deficiency (3) 

Factor XIIIA deficiency (3) 

Factor XIIIB deficiency (3) 

Familial expansile osteolysis (2) 

Familial Mediterranean fever (2) 

?Fanconi anaemia (1) 

Fanconi anaemia-1 (2) 

Favism (3) 

{?Fetal alcohol syndrome} (1) 

?Fetal hydantoin syndrome (1) 

?Fibrodysplasia ossificans progressiva (1) 
Fibromuscular dysplasia of arteries, 135580 (3) 
Fibrosis of the extraocular muscles, congenital (2) 
Fish-eye disease (3) 

[Fish-odour syndrome] (1) 

Fletcher factor deficiency (1) 

{Fluorouracil toxicity, sensitivity to} (1) 

focal dermal hypoplasia (2) 

Fragile X syndrome (3) 

Friedreich ataxia (2) 

Fructose intolerance (3) 

Fucosidosis (3) 

Fucosyltransferase-6 deficiency (3) 

Fukuyama type congenital muscular dystrophy (2) 
Fumarase deficiency (3) 

Fundus flavimaculatus with macular dystrophy (2) 


8q 
20q13.2-q13.3 
6p21.3 

10q 

21q22.3 
8pter-p22 
19q12 

1p32 

12p13 

9q31 

7q21 
16pter-p13.3 
11p15.5 
1p36.2-p34 
19p13.3-p13.2 
1p36.2-p34 
Xq22 

22q12 
11p15.4 
8q24.11-q24.13 
11pl1-ql1 
19p 

Xpl1.4 

Xq22 

4q35 

1q32 

1q23 

13q34 
13q34 

4q35 
5q33-qter 
6p25-p24 
1q31-q32.1 
18q21.1-q22 
16p13 

1q42 
20q13.2-q13.3 
Xq28 
12q24.2 
1p11-qter 
20p12 

2q31 
12q13.2-q24.1 
16q22.1 

Iq 

4q35 
1p22-q21 
Xp22.31 
Xq27.3 
9q13-q21.1 
9q22 

1p34 
19p13.3 
9q31-q33 
1q42.1 
1p21-p13 
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Disorder Location 
G6PD deficiency (3) Xq28 
Galactokinase deficiency (1) 17q21-q22 
Galactose epimerase deficiency (1) 1p36-p35 
Galactosaemia (3) 9p13 
Galactosialidosis (3) 20q13.1 
[y-glutamyltransferase, familial high serum] (2) 22q11.12 
Gardner syndrome (3) 5q21-q22 
Gaucher disease (3) 1q21 
Gaucher disease, variant form (3) 10q21-q22 
Generalized atrophic benign epidermolysis bullosa, 226650 (1) 10q24.3 
Gerstmann-Straussler disease, 137440 (3) 20pter-p12 
Gigantism due to GHRF hypersecretion (1) 20q11.2 
?Gilbert syndrome, 143500 (1) Chr.2 
Glanzmann thrombasthenia, type A (3) 17q21.32 
Glanzmann thrombasthemia, type B (3) 17q20°32 
Glaucoma, primary open angle, juvenile-onset (2) 1q21-q31 
Glioblastoma multiforme (2) 10p12-q23.2 
Glucocorticoid deficiency, due to ACTH unresponsiveness (1) 18p11.2 
Glucose/galactose malabsorption (3) 22q11.2-qter 
Glutaricacidaemia, type I (3) 19p13.2 
Glutaricacidaemia, type IIC (3) 4q32-qter 
Glutaricaciduria, type IIA (1) 15q23-q25 
Glutaricaciduria, type IIB (3) 19q13.3 
Glutathioninuria (1) 22q11.1-q11.2 
Glycerol kinase deficiency (2) Xp21.3-q21.2 
Glycogen storage disease III (1) 1p21 
Glycogen storage disease IV (1) 3p12 
Glycogen storage disease, type I (3) 17q21 
Glycogen storage disease, type II (3) 17q23 
Glycogen storage disease VI (1) 14q21-q22 
Glycogen storage disease VII (3) Icen-q32 
Glycogen storage disease, X-linked hepatic (2) Xp22.2-p22.1 
?Glycoprotein Ia deficiency (2) 5q23-q31 
[Glyoxalase II deficiency] (1) 16p13 
GM1-gangliosidosis (3) 3p21.33 
GM2-gangliosidosis, AB variant (3) 5q31.3-q33.1 
GM2-gangliosidosis, juvenile, adult (3) 15q23-q24 
Goeminne TKCR syndrome (2) Xq28 

Goitre adolescent multinodular (1) 8q24.2-q24.3 
Goitre, congenital (3) 2p13 

Goitre, non-endemic, simple (3) 8q24.2-q24.3 
?Goldenhar syndrome (2) 7p 

Gonadal dysgenesis, XY female type (2) Xp22.11-p21.2 
Gonadal dysgenesis, XY type (3) Yp11.3 
?Gonadotropin deficiency (2) Xp21 

Graves disease, 275000 (1) 14q31 

Greig cephalopolysyndactyly syndrome, 175700 (3) 7p13 
?Growth hormone deficient dwarfism (1) 7p15-p14 
Gustavson syndrome (2) Xq26 
?Gynaecomastia, familial, due to increased aromatase activity (1) 15q21.1 
Gyrate atrophy of choroid and retina with ornithinemia, B6 responsive or unresponsive (3) 10q26 
Haemochromatosis (2) 6p21.3 
Haemodialysis-related amyloidosis (1) 15q21-q22 
Haemolytic anaemia due to ADA excess (1) 20q13.11 
Haemolytic anaemia due to adenylate kinase deficiency (1) 9q34.1 
Haemolytic anaemia due to bisphosphoglycerate mutase deficiency (1) 7q31-q34 
Haemolytic anaemia due to G6PD deficiency (3) Xq28 


eee eee 


Continued. 


931 APPENDIX VII 


Table VII.2 Continued. 


ee eee 


Disorder Location 
Haemolytic anaemia due to glucosephosphate isomerase deficiency (3) 19q13.1 
Haemolytic anaemia due to glutathione peroxidase deficiency (1) 3ql1-q12 
Haemolytic anaemia due to glutathione reductase deficiency (1) 8p21.1 
Haemolytic anaemia due to hexokinase deficiency (1) 10q22 
Haemolytic anaemia due to PGK deficiency (3) Xqi3 
Haemolytic anaemia due to phosphofructokinase deficiency (1) 21q22.3 
Haemolytic anaemia due to triosephosphate isomerase deficiency (3) 12p13 
Haemophilia A (3) Xq28 
Haemophilia B (3) Xq27.1-q27.2 
Haemorrhagic diathesis due to stroke ‘antithrombir’ Pittsburgh (3) 14q32.1 
Haemorrhagic diathesis due to PAI1 deficiency (1) 7q21.3-q22 
Harderoporphyrinuria (3) 3q12 


Heart block, progressive familial, type I (2) 

Heinz body anaemias, a- (3) 

Heinz body anaemias, B- (3) 

Hepatic lipase deficiency (3) 

Hepatocellular carcinoma (1) 

?Hepatocellular carcinoma (1) 

Hepatocellular carcinoma (3) 

Hereditary haemorrhagic telangiectasia, 187300 (3) 
[Hereditary persistence of a-fetoprotein] (3) 
?Hereditary persistence of fetal hemoglobin (3) 
?Hereditary persistence of fetal hemoglobin, heterocellular, Indian type (2) 
?Hermansky—Pudlak syndrome, 203300 (1) 
?Hermansky—Pudlak syndrome, 203300 (1) 
Heterocellular hereditary persistence of fetal hemoglobin, Swiss type (2) 
Heterotaxy, X-linked visceral (2) 

[Hex A pseudodeficiency] (17 

?HHH syndrome (2) 

Hirschsprung disease, 142623 (3) 

Hirschsprung disease-2, 600155 (3) 
[Histidinaemia] (1) 

HMG-CoA lyase deficiency (3) 
Holoprosencephaly, type 3 (2) 
?Holoprosencephaly-1 (2) 

?Holoprosencephaly-2 (2) 

?Holoprosencephaly-4 (2) 

Holt-Oram syndrome (2) 

Homocystinuria, B6-responsive and non-responsive types (3) 
Homocystinuria due to MTHER deficiency (3) 
[?Homosexuality, male] (2) 

HPFH, deletion type (3) 

HPFH, non-deletion type A (3) 

HPFH, non-deletion type G (3) 

HPRT-related gout (3) 

?Humoral hypercalcaemia of malignancy (1) 
Huntington disease (3) 

Hydrocephalus due to aqueductal stenosis, 307000 (3) 
Hydrops fetalis, one form (1) 

3-hydroxyacyl-CoA dehydrogenase deficiency (1) 
Hyperbetalipoproteinaemia (3) 

Hypercalcaemia, hypocalciuric, familial (3) 
Hypercholesterolaemia, familial (3) 
Hyperchylomicronaemia syndrome, familial (3) 
Hyperglycinaemia, isolated non-ketotic, type I (3) 
Hyperglycinaemia, non-ketotic, type Il (1) 
?Hyperimmunoglobulin G1 syndrome (2) 


19q13.2-q13.3 
16pter-p13.3 
11p15.5 
15q21-q23 
11p14-p13 
2q14-q21 
4q32.1 
9q34.1 
4q11-q13 
11p15.5 

7q36 
12q12-q13 
15q15 
Xp22.2 
Xq25-q26 
15q23-q24 
13q34 
10q11.2 
13q22 
12q22-q23 
Ipter-p33 
7q36 
18pter-qll 
2p21 
14q11.1-q13 
12q21.3-q22 
21q22.3 
1p36.3 

Xq28 
11p15.5 
11p15.5 
11p15.5 
Xq26-q27.2 
12p12.1-p11.2 
4p16.3 

Xq28 
19q13.1 
Chr.7 

2p24 
3q21-q24 
19p13.2-p13.1 
8p22 

9p22 
3p21.2-p21.1 
1432.33 
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Hyperkalaemic periodic paralysis (3) 
?Hyperleucinaemia-isoleucinaemia or hypervalinaemia (1) 
Hyperlipoproteinemia I (1) 

Hyperlipoproteinaemia, type Ib (3) 
Hyperlipoproteinaemia, type III (3) 

Hyperoxaluria, primary, type 1 (3) 


Hyperphenylalaninaemia due to pterin-4a-carbinolamine dehydratase deficiency, 264070 (3) 


[Hyperphenylalaninaemia, mild] (3) 

[? Hyperproglucagonaemia] (1) 
Hyperproinsulinaemia, familial (3) 
[Hyperproreninaemia] (3) 

{? Hypertension, essential} (1) 

?Hypertension, essential, 145500 (1) 
(Hypertension, essential, susceptibility to} (3) 
Hyperthyroidism congenital (3) 
Hypertriglyceridaemia (3) 
Hypertriglyceridaemia, one form (3) 
?Hypervalinaemia or hyperleucine-isoleucinaemia (1) 
Hypoalphalipoproteinaemia (3) 
Hypobetalipoproteinaemia (3) 
Hypocalcaemia, autosomal dominant (3) 
Hypocalciuric hypercalcaemia, type II (2) 
[Hypoceruloplasminaemia, hereditary] (1) 
Hypochondroplasia, 146000 (3) 
Hypofibrinogenaemia, y types (3) 
?Hypoglycaemia due to PCK1 deficiency (1) 
Hypogonadism, hypergonadotropic (3) 
?Hypogonadotropic hypogonadism due to GNRH deficiency, 227200 (1) 
Hypokalaemic periodic paralysis, 170400 (3) 
Hypomagnesaemia, X-linked primary (2) 
?Hypomelanosis of Ito (2) 

?Hypomelanosis of Ito (2) 
Hypoparathyroidism, autosomal dominant (3) 
Hypoparathyroidism, autosomal recessive (3) 
Hypoparathyroidism, familial (2) 
Hypoparathyroidism, X-linked (2) 
?Hypophosphatasia, adult, 146300 (1) 
Hypophosphatasia, infantile, 241500 (3) 
Hypophosphataemia, hereditary (2) 
?Hypophosphataemia with deafness (2) 
Hypoprothrombinaemia (3) 
?Hypospadias-dysphagia syndrome (2) 
Hypothyroidism, congenital (3) 
Hypothyroidism, hereditary congenital (3) 
Hypothyroidism, non-goitrous (3) 
Hypothyroidism, non-goitrous, due to TSH resistance (3) 
Ichthyosis bullosa of Siemens, 146800 (3) 
?Ichthyosis vulgaris, 146700 (1) 

Ichthyosis, X-linked (3) 

[li blood group, 110800] (1) 

?Immotile cilia syndrome (2) 
Immunodeficiency, X-linked, with hyper-IgM (3) 
Incontinentia pigmenti, familial (2) 
Incontinentia pigmenti, sporadic type (2) 
[Inosine triphosphatase deficiency] (1) 
Insomnia, fatal familial (3) 

Insulin-dependent diabetes mellitus-2 (2) 


17q23.1-q25.3 
12pter-q12 
8p22 
19q13.2 
19q13.2 
2q36-q37 
10q22 
12q24.1 
2q36-q37 
11p15.5 
1q32 
16p13.11 
17q21-q22 
1q42-q43 
14q31 
11q23 
11q23 
Chr.19 
11q23 

2p24 
3q21-q24 
19p13.3 
3q21-q24 
4p16.3 
4q28 
20q13.31 
19q13.32 
8p21-p11.2 
1q32 
Xp22.2 
15q11-q13 
9q33-qter 
11p15.3-p15.1 
11p15.3-p15.1 
3q13 
Xq26-q27 
1p36.1-p34 
1p36.1-p34 
Xp22.2-p22.1 
Xp22 
11p11-qi2 
5p13-p12 
2p13 
8q24.2-q24.3 
1p13 

14q31 
12q11-q13 
1q21 
Xp22.32 
9q21 

6p 

Xq26 

Xq28 
Xp11.21 
20p 
20pter-p12 
11p15.5 
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Disorder 

Location 
Interferon, o, deficiency (1) 9p21 
Interferon, immune, deficiency (1) 8 24.1 
?Isolated growth hormone deficiency due to defect in GHRF (1) 20g 2 
Isolated growth hormone deficiency, Illig type with absent GH and Kowarski aa 

type with bioinactive GH (3) 17q22-q24 

Isovalericacidaemia (3) 15q14-q1 5 
Jackson—Weiss syndrome, 123150 (3) eae es 
?Jacobsen syndrome (2) 1 ni 
Juberg—Marsidi syndrome (2) X 12. 21 
Junctional epidermolysis bullosa inversa (2) 1 1 ; 
Kallmann syndrome (3) Xp22 3 
Kanzaki disease (3) 22q1 1 
[« light chain deficiency] (3) 2p12 
Keratoderma, palmoplantar, non-epidermolytic (3) 12q11-q13 
Keratosis follicularis spinulosa decalvans (2) Xp22.2-p21.2 
3-ketothiolase deficiency (3) 11q22.3-q23 1 
[Kininogen deficiency] (3) 3q26-qter j 
?Klippel—Feil syndrome (2) 5q11.2 
Kniest dysplasia (3) 12q13.11-q13.2 
Kostmann neutropenia, 202700 (3) 1p35-p34.3 
Krabbe disease (3) 14q24.3-q32.1 
?Lactase deficiency, adult, 223100 (1) 2q21 
?Lactase deficiency, congenital (1) 2q21 
Lactic acidosis due to defect in iron-sulphur cluster of complex I (1) 2q33-q34 
?Lactoferrin-deficient neutrophils, 245480 (1) 3q21-q23 
Lamellar ichthyosis, autosomal recessive (2) 14q11.2 
Langer—Giedion syndrome (2) 8q24.11-q24.13 
Laron dwarfism (3) 5p13-p12 
?Laryngeal adductor paralysis (2) 6p21.3-p21.2 
{Lead poisoning, susceptibility to} (3) 9q34 
Leucocyte adhesion deficiency (1) 21q22.3 
?Leiomyomata, multiple hereditary cutaneous (2) 18p11.32 
Leiomyomatosis, diffuse (1) Xq22 
Leiomyomatosis-nephropathy syndrome, 308940 (1) Xq22 
Leprechaunism (3) 19p13.2 
Lesch—Nyhan syndrome (3) Xq26-q27.2 
?Letterer—Siwe disease (2) 13q14-q31 
Leukaemia, acute lymphoblastic (1) 19p13.3 
Leukaemia, acute lymphoblastic (2) 9p22-p21 
?Leukaemia, acute lymphocytic, with 4/11 translocation (3) 4q21 
Leukaemia, acute myeloid (2) 9q34.1 
Leukaemia, acute myeloid (3) 21q22.3 
Leukaemia, acute myeloid, M2 type (1) Xp22.32 
Leukaemia, acute non-lymphocytic (2) 6p23 
Leukaemia, acute pre-B-cell (2) 1q23 
Leukaemia, acute promyelocytic (1) 17q12 
Leukaemia, acute promyelocytic (2) 15q22 
Leukaemia, acute T-cell (2) 11p13 
Leukaemia, acute, T-cell (2) 11p13 
Leukaemia, chronic lymphocytic, B-cell (2) 13q14 
Leukaemia, chronic myeloid (3) 22q11.21 
Leukaemia, chronic myeloid (3) 9q34.1 
Leukaemia, myeloid /lymphoid or mixed-lineage (2) 11q23 
Leukaemia, T-cell acute lymphoblastic (2) 11p15 

9q34.3 


Leukaemia, T-cell acute lymphoblastic (2) 
Leukaemia, T-cell acute lymphoblastoid (2) 
Leukaemia, T-cell acute lymphocytic (2) 


19p13.2-p13.1 
10q24 
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Disorder Location 
?Leukaemia, transient (2) 21q11.2 
Leukaemia-1, T-cell acute lymphoblastic (3) 1p32 
Leukaemia-2, T-cell acute lymphoblastic (3) 9q31 
Leukaemia/lymphoma, B-cell, 1 (2) 11q13.3 
Leukaemia/lymphoma, B-cell, 2 (2) 18q21.3 
Leukaemia/lymphoma, B-cell, 3 (2) 19q13 
Leukaemia/lymphoma, T-cell (2) 1432.1 
Leukaemia/lymphoma, T-cell (2) 2q34 
Leukaemia/lymphoma, T-cell (3) 14q11.2 
Liddle syndrome (3) 16p12.3 
Li-Fraumeni syndrome (3) 17p13.1 
Lipoamide dehydrogenase deficiency (3) 7q31-q32 
Lipoma, benign (2) 12q15 
Lipoprotein lipase deficiency (3) 8p22 
Liposarcoma (1) 19p13.2-q13.3 
Long QT syndrome-1 (2) 11p15.5 
Long QT syndrome-2 (3) 7q35-q36 
Long QT syndrome-3 (3) 3p24-p21 
Lowe syndrome (3) Xq26.1 
{Lupus erythematosus, susceptibility to} (2) 12pter-p12 
Lupus erythematosus, systemic, 152700 (1) 1q23 
Lymphoma, B-cell (2) 3q27 
Lymphoma, diffuse large cell (3) 3q27 
Lymphoma /leukaemia, B-cell, variant (1) 18q21.3 
Lymphoproliferative syndrome, X-linked (2) Xq25 
?Lynch cancer family syndrome II (2) 18q11-q12 
?Lysosomal acid phosphatase deficiency (1) 11p12-p11 
Machado-Joseph disease (2) 14q24.3-q31 
Macrocytic anaemia of 5q- syndrome, refractory (2) 5q12-q32 
Macrocytic anaemia refractory, of 5g- syndrome, 153550 (3) 5q31.1 
[Macrothrombocytopenia] (1) 7q11.2 
Macular dystrophy (3) 6p21.1-cen 
Macular dystrophy, atypical vitelliform (2) 8q24 
Macular dystrophy, dominant cystoid (2) 7p21-p15 
Macular dystrophy, North Carolina type (2) 6q14-q16.2 
Macular dystrophy, vitelliform type (2) 11q13 
Male germ cell tumour (2) 12q22 
?Male infertility due to acrosin deficiency (2) 22q13-qter 
?Male infertility, familial (1) 11p13 
?Male infertility, familial (1) 11p13 
?Male pseudohermaphroditism due to defective LH (1) 19q13.32 
Malignant hyperthermia susceptibility 2 (2) 17q11.2-q24 
Malignant hyperthermia susceptibility-1, 145600 (3) 19q13.1 
Malignant hyperthermia susceptibility-3, 154276 (3) 7q21-q22 
Malignant melanoma, cutaneous (2) 1p36 
?Manic-depressive illness, X-linked (2) Xq28 
Mannosidosis (1) 19cen-q1i2 
Maple syrup urine disease, type 3 (3) 6p22-p21 
Maple syrup urine disease, type Ia (3) 19p13.1-q13.2 
Maple syrup urine disease, type II (3) 1p31 
Marfan syndrome, 154700 (3) 15q21.1 
Maroteaux—Lamy syndrome, several forms (3) 5q11-q13 
MASA syndrome, 303350 (3) Xq28 

Mast cell leukaemia (3) 4q12 
Maturity-onset diabetes of the young, type III (2) 12q22-qter 
McArdle disease (3) 11q13 
McCune-Albright polyostotic fibrous dysplasia, 174800 (3) 20q13.2 
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Disorder Location 
McLeod phenotype (3) Xp21.2-p21.1 
Medullary thyroid carcinoma, 155240 (3) 10q11.2 
Megalocornea, X-linked (2) Xq21.3-q22 
?Melanoma (1) 2p25.3 
Melanoma (1) 9p21 
Melanoma, cutaneous malignant (2) 9p21 
?Melkersson—Rosenthal syndrome (2) 9pil 
Membroproliferative glomerulonephritis (1) 1q32 
Meningioma, NF2-related (3) 22q12.2 
Meningioma, SIS-related (3) 22q12.3-q13.1 
Menkes’ disease (2) Xql2-q13 
Mental retardation, Snyder—Robinson type (2) Xp21 
Mental retardation, X-linked, FRAXE type (3) Xq28 
Mental retardation, X-linked, FRAXF type (3) Xq28 
?Mental retardation, X-linked non-specific, with aphasia (2) Xpi11 
Mental retardation, X-linked, syndromic-1, with dystonic movements, ataxia, and seizures (2) Xp22.2-p22.1 
Mental retardation, X-linked, syndromic-2, with dysmorphism and cerebral atrophy (2) Xp11-q21 
Mental retardation, X-linked, syndromic-3, with spastic diplegia (2) Xp11-q21.3 
Mental retardation, X-linked, syndromic-4, with congenital contractures and low fingertip 

arches (2) Xq13-q22 
Mental retardation, X-linked, syndromic-5, with Dandy—Walker malformation, basal ganglia 

disease, and seizures (2) Xq25-q27 
Mental retardation, X-linked, syndromic-6, with gynaecomastia and obesity (2) Xp21.1-q22 
Mental retardation, X-linked-1, non-dysmorphic (2) Xp22 
?Mental retardation, X-linked-2, non-dysmorphic (2) Xql1-qi2 
Mental retardation, X-linked-3 (2) Xq28 
Mental retardation-skeletal dysplasia (2) Xq28 


Mephenytoin poor metabolizer (3) 

Metachromatic leucodystrophy (3) 

Metachromatic leucodystrophy due to deficiency of SAP-1 (3) 
Metaphyseal chondrodysplasia, Murk Jansen type, 156400 (3) 
Metaphyseal chondrodysplasia, Schmid type (3) 
Methaemoglobinaemia due to cytochrome b5 deficiency (3) 
Methaemoglobinaemia, enzymopathic (3) 
Methaemoglobinaemias, «- (3) 

Methaemoglobinaemias, B- (3) 

Methylmalonicaciduria, mutase deficiency type (3) 
Mevalonicaciduria (3) 

?Microphthalmia with linear skin defects (2) 

Migraine, hemiplegic-1 (2) 

Miller—Dieker lissencephaly syndrome (2) 

?Mitochondrial complex I deficiency, 252010 (1) 

MODY, one form (3) 

MODY, type I (2) 

MODY, type II, 125851 (3) 

?Moebius syndrome (2) 

?Monocyte carboxyesterase deficiency (1) 

Mucolipidosis II (1) 

Mucolipidosis II (1) 

Mucopolysaccharidosis Ih (3) 

Mucopolysaccharidosis Ih/s (3) 

Mucopolysaccharidosis II (3) 

Mucopolysaccharidosis Is (3) 

Mucopolysaccharidosis IVA (3) 

Mucopolysaccharidosis IVB (3) 

Mucopolysaccharidosis VII (3) 

Multiple carboxylase deficiency, biotin-responsive (3) 


10q24.1-q24.3 
22q13.31-qter 
10q21-q22 
3p22-p21.1 
6q21-q22.3 
18q23 
22q13.31-qter 
16pter-p13.3 
11p15.5 

6p21 

Chr.12 
Xp22.2 

19p13 
17p13.3 
11q13 
11p15.5 
20q13 
7p15-p13 
13q12.2-q13 
16q13-q22.1 
4q21-q23 
4q21-q23 
4p16.3 
4p16.3 

Xq28 

4p16.3 
16q24.3 
3p21.33 
7q21.11 
21q22.1 
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Multiple endocrine neoplasia I (1) 11q13 
Multiple endocrine neoplasia IIA, 171400 (3) 10q11.2 
Multiple endocrine neoplasia IIB, 162300 (3) 10q11.2 
?Multiple lipomatosis (2) 12q15 
{?Multiple sclerosis, susceptibility to} (2) 18q22-qter 
Muscle glycogenosis (3) Xq13 
Muscular dystrophy, congenital, merosin-negative (2) 6q22-q23 
Muscular dystrophy, Duchenne-like, autosomal (2) 13q12-q13 
Muscular dystrophy, Duchenne-like, type 2 (3) 17q12-q21.33 
Muscular dystrophy, limb-girdle, autosomal dominant (2) 5q22.3-q31.3 
Muscular dystrophy, limb-girdle, type 2A (2) 15q15.1-q21.1 
Muscular dystrophy, limb-girdle, type 2B (2) 2p16-p13 
Myelodysplasia syndrome-1 (3) 3q26 
Myelodysplastic syndrome, preleukaemic (3) 5q31.1 
Myelogenous leukaemia, acute (3) 5q31.1 
Myeloid leukaemia, acute, M4Eo subtype (2) 16q22 
Myeloperoxidase deficiency (3) 17q21.3-q22 
Myoadenylate deaminase deficiency (3) 1p21-p13 
Myocardial infarction, susceptibility to} (3) 17q23 
Myoglobinuria/haemolysis due to PGK deficiency (3) Xql13 
?Myopathy, desminopathic (1) 2q35 
Myopathy, distal (2) 14q 
Myopathy due to phosphoglycerate mutase deficiency (3) 7p13-p12.3 
?Myopathy due to succinate dehydrogenase deficiency (1) 1p22.1-qter 
Myopia-1 (2) Xq28 


Myotonia congenita, atypical acetazolamide-responsive (3) 
Myotonia congenita, dominant, 160800 (3) 

Myotonia congenita, recessive, 255700 (3) 

Myotonic dystrophy (3) 

Myotubular myopathy, X-linked (2) 

Myxoid liposarcoma (3) 

?N syndrome, 310465 (1) 

Nail-patella syndrome (2) 

Nance-Horan syndrome (2) 

Nemaline myopathy-1, 161800 (3) 

Neonatal alloimmune thrombocytopenia (2) 

Neonatal hyperparathyroidism, 239200 (3) 
Nephrolithiasis 2, X-linked (2) 

Nephrolithiasis, X-linked, with renal failure (2) 
Nephronophthisis, juvenile (2) 

Nephrosis, congenital, Finnish (2) 

Neuroblastoma (2) 

Neuroepithelioma (2) 

Neurofibromatosis, type 2 (3) 

Neurofibromatosis, type I (3) 

Neuropathy, recurrent, with pressure palsies, 162500 (3) 
Neutropenia, immune (2) 

Neutropenia, neonatal alloimmune (1) 

Niemann-Pick disease, type A (3) 

Niemann-Pick disease, type B (3) 

Niemann-Pick disease, type C (2) 

Night blindness, congenital stationary, type 3, 163500 (3) 
Night blindness, congenital stationary, type I (2) 

Night blindness, congenital stationery, rhodopsin-related (3) 


Noonan syndrome-1 (2) 
Norrie disease (3) 


{Non-insulin dependent diabetes mellitus, susceptibility to} (2) 


17q23.1-q25.3 
7q35 

7q35 
19q13.2-q13.3 
Xq28 
12q13.1-q13.2 
Xp22.3-p21.1 
9q34.1 
Xp22.3-p21.1 
1q22-q23 
5q23-q31 
3q21-q24 
Xp11.23-p11.22 
Xp11.22 

2q13 
19q12-q13.1 
1p36.2-p36.1 
22q12 
22q12.2 
17q11.2 
17p11.2 

1q23 

Chr.4 
11p15.4-p15.1 
11p15.4-p15.1 
18q11-q12 
4p16.3 
Xp11.3 
3q21-q24 
19q13.3 
12q22-qter 
Xp11.4 
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Disorder 


Location 
Norum disease (3) 16q22.1 
Nucleoside phosphorylase deficiency, immunodeficiency due to (3) 14q13.1 
?Obesity (2) 7q31 
?Ocular albinism, autosomal recessive (2) 6q13-q15 


Ocular albinism, Forsius—Eriksson type (2) 
Ocular albinism, Nettleship—Falls type (3) 
Ocular albinism with sensorineural deafness (2) 
Oculopharyngeal muscular dystrophy-1 (2) 
Oestrogen resistance (3) 

Optic atrophy 1 (2) 

Optic nerve coloboma with renal anomalies, 120330 (3) 
Ornithine transcarbamylase deficiency (3) 
Orofacial cleft (2) 

Oroticaciduria (1) 

OSMED syndrome, 215150 (3) 

Osteoarthrosis, precocious (3) 


Osteogenesis imperfecta, 4 clinical forms, 166200, 166210, 259420, 166220 (3) 
Osteogenesis imperfecta, 4 clinical forms, 166200, 166210, 259420, 166220 (3) 


?Osteopetrosis, 259700 (1) 

Osteoporosis, idiopathic, 166710 (3) 
?Osteoporosis, involutional (1) 

Osteosarcoma, 259500 (2) 

Otopalatodigital syndrome, type I (2) 

Ovarian cancer, serous (2) 

Ovarian cancer, sporadic (3) 

Ovarian carcinoma, 167000 (2) 

Ovarian carcinoma (2) 

Ovarian carcinoma (3) 

Ovarian failure, premature (2) 

Pachyonychia congenita, Jackson—Lawler type (2) 
?Paget disease of bone (2) 

?Pallister—Hall syndrome (2) 

Palmoplantar keratoderma, Bothnia type (2) 
Pancreatic lipase deficiency (1) 
?Panhypopituitarism, X-linked (2) 
Paraganglioma (2) 

Paramyotonia congenita, 168300 (3) 
Paraneoplastic sensory neuropathy (1) 
Parathyroid adenomatosis 1 (2) 

?Parietal foramina (2) 

{?Parkinsonism, susceptibility to} (1) 

Paroxysmal nocturnal haemoglobinuria (3) 
Partington syndrome II (2) 
Pelizaeus—Merzbacher disease (3) 

Pelviureteric junction obstruction (2) 

?Pendred syndrome (2) 

PEO with mitochondrial DNA deletions (2) 
Perineal hypospadias (3) 

Periodontitis, juvenile (2) 

Peroxisomal bifunctional enzyme deficiency (1) 
Persistent hyperinsulinaemic hypoglycaemia of infancy (2) 
Persistent Mullerian duct syndrome (3) 

Peters anomaly (3) 

Pfeiffer syndrome, 101600 (3) 

Pfeiffer syndrome, 101600 (3) 

Phenylketonuria (3) 

Phenylketonuria, atypical, due to GCH1 deficiency, 233910 (1) 
Phenylketonuria due to dihydropteridine reductase deficiency (3) 


Xp11.4-p11.23 
Xp22.3 
Xp22.3 
14q11.2-q13 
6q25.1 
3q28-qter 
10q25 

Xp21.1 
6p24.3 

3q13 

6p21.3 
12q13.11-q13.2 
17q21.31-q22.05 
7q22.1 
1p21-p13 
17q21.31-q22.05 
12q12-q14 
13q14.1-q14.2 
Xq28 
6q26-q27 
17q21 
19q13.1-q13.2 
9p24. 

16q22.1 
Xq26-q27 
17q12-q21 
6p21.3 

3p25.3 
12q11-q13 
10q26.1 
Xq21.3-q22 
11q22.3-q23.2 
17q23.1-q25.3 
1p34 

11q13 
11p12-p11.12 
22q13.1 
Xq22.1 
Xp22-p21 
Xq22 

6p 

8q24 

10q 

Xqli-qi2 
4q11-q13 
3q26.3-q28 
11p15.1-p14 
19p13.3-p13.2 
11p13 

10q26 
8p12-p11.2 
12q24.1 
14q22.1-q22.2 
4p15.31 
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Phenylketonuria due to PTS deficiency (3) 
Phaeochromocytoma (2) 

Phosphoribosyl pyrophosphate synthetase-related gout (3) 
?Phosphorylase kinase deficiency of liver and muscle, 261750 (2) 
Piebaldism (3) 

Pituitary hormone deficiency, combined (3) 

PK deficiency haemolytic anaemia (3) 

[Placental lactogen deficiency] (1) 

Placental steroid sulphatase deficiency (3) 

Plasmin inhibitor deficiency (3) 

Plasminogen activator deficiency (1) 

Plasminogen deficiency, types I and II (1) 

Plasminogen Tochigi disease (3) 

Platelet «./5 storage pool deficiency (1) 

Platelet glycoprotein IV deficiency (3) 

{Polio, susceptibility to} (2) 

Polycystic kidney disease, adult, type II (2) 

Polycystic kidney disease, autosomal recessive (2) 
Polycystic kidney disease, infantile severe, with tuberous sclerosis (3) 
Polycystic kidney disease-1 (3) 

Polyposis coli, familial (3) 

Porphyria, acute hepatic (3) 

Porphyria, acute intermittent (3) 

Porphyria, Chester type (2) 

Porphyria, congenital erythropoietic (3) 

Porphyria cutanea tarda (3) 

Porphyria, hepatoerythropoietic (3) 

Porphyria variegata (2) 

?Prader—Willi syndrome (1) 

Prader-Willi syndrome (2) 

Precocious puberty, male, 176410 (3) 

{Pre-eclampsia, susceptibility to} (3) 

Progressive cone dystrophy (2) 

Prolactinoma, hyperparathyroidism, carcinoid syndrome (2) 
Prolidase deficiency (3) 

Properdin deficiency, X-linked (3) 

Propionicacidaemia, type I or pccA type (1) 
Propionicacidaemia, type II or pccB type (3) 

Prostate cancer (3) 

Protein C cofactor deficiency (3) 

Protein C inhibitor deficiency (2) 

Protein S deficiency (3) 

Protoporphyria, erythropoietic (3) 

Protoporphyria, erythropoietic, recessive, with liver failure (3) 
Pseudoachondroplastic dysplasia (2) 
Pseudohermaphroditism, male, with gynaecomastia (3) 
Pseudohypoaldosteronism (1) 
Pseudohypoparathyroidism, type Ia, 103580 (3) 
Pseudovaginal perineoscrotal hypospadias (3) 
Pseudo-vitamin D dependency rickets 1 (2) 
Pseudo-Zellweger syndrome (1) 

Psoriasis susceptibility (2) 

Pulmonary alveolar proteinosis, congenital, 265120 (3) 
Purpura fulminans, neonatal (1) 

?Pyridoxine dependency with seizures (1) 
Pyropoikilocytosis (3) 

Pyruvate carboxylase deficiency (1) 


11q22.3-q23.3 
Ip 

Xq22-q24 
16q12-q13.1 
4ql12 

3pl1 

1q21 
17q22-q24 
Xp22.32 
17pter-p12 
8p12 

6q26 

6q26 
1q23-q25 
7ql1.2 
19q13.2-q13.3 
4q21-q23 
6p21.1-p12 
16p13.3 
16p13.31-p13.12 
5q21-q22 
9q34 
11q24.1-q24.2 
11q23.1 
10q25.2-q26.3 
1p34 

1p34 

14g32 

15q12 

15q11 

2p21 
1q42-q43 
Xp11.3 

11q13 
19cen-q13.11 
Xp11.4-p11.23 
13q32 
3q21-q22 
Xqll1-q12 
1q23 

1432.1 
3p11.1-q11.2 
18q21.3 
18q21.3 
19q12 

9q22 

4q31.1 
20q13.2 

Chr.2 

12q14 
3p23-p22 

17q 
2p12-pl11.2 
2q13-q14 
2q31 

1q21 

llq 
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Pyruvate dehydrogenase deficiency (3) Xp22.2-p22.1 
Rabson—Mendenhall syndrome (3) 19p13.2 
?Ragweed sensitivity (2) 6p21.3 
Renal cell carcinoma (2) 3p14.2 
Renal cell carcinoma (3) 3p26-p25 
?Renal cell carcinoma, papillary, 1 (2) 1q21 
Renal cell carcinoma, papillary, 2 (2) Xp11.2 
?Renal glucosuria, 253100 (1) 16p11.2 
[Renal glucosuria] (2) 6p21.3 
Renal tubular acidosis-osteopetrosis syndrome (3) 8q22 
?Resistance/susceptibility to TB, etc. (1) 2q35 
?Retinal cone dystrophy-1 (2) 6q25-q26 
Retinitis pigmentosa, autosomal recessive (3) 3q21-q24 
Retinitis pigmentosa, digenic (3) 11q13 
Retinitis pigmentosa, digenic (3) 6p21.1-cen 
Retinitis pigmentosa, peripherin-related (3) 6p21.1-cen 
Retinitis pigmentosa-1 (2) 8pl11-q21 
Retinitis pigmentosa-10 (2) 7q31-q35 
Retinitis pigmentosa-11 (2) 19q13.4 
Retinitis pigmentosa-12, autosomal recessive (2) 1q31-q32.1 
Retinitis pigmentosa-13 (2) 17p 
Retinitis pigmentosa-14 (2) 6p21.3 
Retinitis pigmentosa-2 (2) Xp11.3 
Retinitis pigmentosa-3 (2) Xp21.1 
Retinitis pigmentosa-4, autosomal dominant (3) 3q21-q24 
?Retinitis pigmentosa-6 (2) Xp21.3-p21.2 
Retinitis pigmentosa-9 (2) 7p15.1-p13 
Retinitis punctata albescens (3) 6p21.1-cen 
Retinoblastoma (3) 13q14.1-q14.2 
?Retinol binding protein, deficiency of (1) 10q23-q24 
Retinoschisis (2) Xp22.3-p22.1 
?Rett syndrome (2) Xp 
Rhabdomyosarcoma (2) 11p15.5 
Rhabdomyosarcoma, alveolar, 268200 (3) 13q14.1 
Rhabdomyosarcoma, alveolar, 268220 (3) 2q35 
Rh-null disease (1) 3cen-q22 
?Rh-null haemolytic anaemia (1) 1p36.2-p34 
Rickets, vitamin D-resistant (3) 12q12-q14 
Rieger syndrome (2) 4q25-q27 
Rippling muscle disease-1 (2) 1q41 

Rod monochromacy (2) Chr.14 
?Rothmund-Thomson syndrome (2) Chr.8 
Rubinstein—Taybi syndrome (2) 16p13.3 
Russell-Silver syndrome (2) 17q25 
Saethre—Chotzen syndrome (2) 7p21 
Salivary gland pleomorphic adenoma (2) 8q12 

Salla disease (2) 6q 
Sandhoff disease (3) 5q13 
?Sanfilippo disease, type IIIC (2) Chr.14 
Sanfilippo syndrome D (1) 12q14 
Sarcoma, synovial (3) Xp11.2 
Schindler disease (3) 22q11 
?Schizophrenia (2) 5q11.2-q13.3 


Schizophrenia, chronic (3) 
{?Schizophrenia, susceptibility to} (2) 
Schizophrenia-3 (2) 

Schwannoma, sporadic (3) 


21q21.3-q22.05 
3q13.3 
6pter-p22 
22q12.2 
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Sclerotylosis (2) 4q28-q31 
SED congenita (3) 12q13.11-q13.2 
Segawa syndrome, recessive (3) 11p15.5 
Selective T-cell defect (3) 2q12 

Severe combined immunodeficiency due to ADA deficiency (3) 20q13.11 
Severe combined immunodeficiency due to IL2 deficiency (1) 4q26-q27 
Severe combined immunodeficiency, HLA class II-negative type, 209920 (2) 19p13.1 
?Severe combined immunodeficiency, type I (1) 8ql1 

Severe combined immunodeficiency, X-linked, 300400 (3) Xq13 

Short stature (2) Xpter-p22.32 
?Sialidosis (2) 6p21.3 

Sickle cell anaemia (3) 11p15.5 
Simpson—Golabi-Behmel syndrome (2) Xq26 

?Situs inversus viscerum (2) 14q32 
Sjogren—Larsson syndrome (2) 17q11.2 

2SLE (1) 1q32 
Small-cell cancer of lung (2) 3p23-p21 
SMED Strudwick type (3) 12q13.11-q13.2 
?Smith—Lemli—Opitz syndrome (2) 7p34-qter 
Smith—Magenis syndrome (2) 17p11.2 
Somatotrophinoma (2) 11q13 
Somatotrophinoma (3) 20q13.2 


Sorsby fundus dystrophy, 136900 (3) 

Sorsby fundus dystrophy (2) 

Spastic paraplegia 2, 312920 (3) 

Spastic paraplegia, 312900 (3) 

Spastic paraplegia 5A (2) 

Spastic paraplegia-3 (2) 

Spastic paraplegia-4 (2) 

Spastic paraplegia-6 (2) 

Spherocytosis, hereditary (3) 

Spherocytosis, hereditary, Japanese type (3) 
Spherocytosis, recessive (3) 

Spherocytosis-1 (3) 

Spherocytosis-2 (3) 

Spinal and bulbar muscular atrophy of Kennedy, 313200 (3) 
Spinal muscular atrophy II (2) 

Spinal muscular atrophy III (2) 

Spinal muscular atrophy X-linked lethal infantile (2) 
Spinocerebellar ataxia, type 4 (2) 
Spinocerebellar ataxia, type 5 (2) 
Spinocerebellar ataxia-1 (3) 

Spinocerebellar ataxia-3 (2) 

Spinocerebellar atrophy II (2) 

Split hand/foot malformation, type 2 (2) 
Split-hand/split-foot malformation, type 1 (2) 
Spondyloepiphyseal dysplasia tarda (2) 
Stargardt disease 2 (2) 

Stargardt disease 3 (2) 

Stargardt macular dystrophy (2) 

Startle disease, autosomal recessive (3) 


Startle disease/hyperefplexia, autosomal dominant, 149400 (3) 


Stickler syndrome, type 2 (2) 

Stickler syndrome, type I (3) 

Stickler syndrome, type II, 184840 (3) 
?Stiff skin syndrome (2) 
?Stomatocytosis I (1) 


22q12.1-q13.2 
22q13.1-qter 
Xq22 

Xq28 
8p12-q13 
14q 
2p24-p21 
15q11.1 
17q21-q22 
15q15 

1q21 
14q22-q23.2 
8p11.2 
Xql1-q12 
5q12.2-q13.3 
5q12.2-q13.3 
Xp 

16q 
11p11-ql1 
6p23 
14q24.3-qter 
12q24 

Xq26 
7q21.2-q21.3 
Xp22.2-p22.1 
13q34 
6cen-ql14 
1p21-p13 
5q32 

5q32 
6p22-p21.3 
12q13.11-q13.2 
6p21.3 
Chr.15 
9q34.1 
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Disorder 

Location 
Sucrose intolerance (1) 
Supravalvar aortic stenosis, 185500 (3) aos 
?{Susceptibility to IDDM} (1) ty 
{Susceptibility to measles} (1) _ 
Tay-Sachs disease (3) sles 
Thalassaemias, a- (3) ewe 
Thalassaemias, B- (3) u aes 
Thoracoabdominal syndrome (2) X 5 26.1 
Thrombocytopenia, neonatal alloimmune (1) i 1.32 
?Thrombocytopenia, Paris—Trousseau type (2) 11 23 
Thrombocytopenia, X-linked, 313900 (3) X M1 23-p11.22 
?Thrombophilia due to elevated HRG (1) 3 28. 29 . 
Thrombophilia due to excessive plasminogen activator inhibitor (1) 7421 > 22 
Thrombophilia due to heparin cofactor II deficiency (3) 29¢ ul ; 
Thrombophilia due to protein C deficiency (3) 2 i3- 14 
Thrombophilia due to thrombomodulin defect (3) aaah 
Thromboxane synthase deficiency (2) 7q34 
Thymine-uraciluria (1) 1p22-q21 
Thyroid adenoma, hyperfunctioning (3) 14q31 
Thyroid hormone resistance, 274300, 188570 (3) 3p24.3 
Thyroid iodine peroxidase deficiency (1) 2p13 
Thyroid papillary carcinoma (1) 10q11-q12 
Thyrotropin-releasing hormone deficiency (1) Chr.3 
Torsion dystonia (2) 9q32-q34 
Torsion dystonia-Parkinsonism, Filipino type (2) Xql12-q13.1 
Total anomalous pulmonary venous return (2) 4p13-q12 
?Tourette syndrome (2) 18q22.1 
?Townes-Brocks syndrome (2) 16q12.1 
Transcobalamin II deficiency (3) 22q11.2-qter 
[Transcortin deficiency] (1) 14q32.1 
Treacher Collns mandibulofacial dysostosis (2) 5q32-q33.1 
Trichorhinophalangeal syndrome, type I (2) 8q24.12 
Triphalangeal thumb-polysyndactyly syndrome (2) 7q36 
Trypsinogen deficiency (1) 7q32-qter 
{?Tuberculosis, susceptibility to] (2) 2q 
Tuberous sclerosis-1 (2) 9q34 
Tuberous sclerosis-2 (2) 16p13.3 
Turcot syndrome with glioblastoma, 276300 (3) 3p21.3 
Turner syndrome (1) Xqi3.1 
Tylosis with oesophageal cancer (2) 17q23-qter 
Tyrosinaemia, type I (3) 15q23-q25 
Tyrosinaemia, type II (3) 16q22.1-q22.3 
Tyrosinaemia, type III (1) 12q14-qter 
Urate oxidase deficiency (1) 1p22 
Urolithiasis, 2,8-dihydroxyadenine (3) 16q24 
Usher syndrome, type 1A (2) 14q32 
Usher syndrome, type 1B (2) 11q13.5 
Usher syndrome, type 1C (2) 11p15.1 
Usher syndrome, type 2 (2) 1q32 
Usher syndrome, type 3 (2) 3q21-q25 
van der Woude syndrome (2) 1q32 
Velocardiofacial syndrome, 192430 (2) 22qi1 
Venous malformations, multiple cutaneous and mucosal (2) 9p 
Virilization, maternal and fetal, from placental aromatase deficiency (3) 15q21.1 
Vitreoretinopathy, exudative, familial (2) 11q13-q23 
Vitreoretinopathy, neovascular inflammatory (2) 11q13 

1q21-q22 


{Vivax malaria, susceptibility to} (1) 
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Disorder Location 
von Hippel—Lindau syndrome (3) 3p26-p25 
von Willebrand disease (3) 12p13.3 
Waardenburg syndrome, type 2A, 193510 (3) 3p14.1-p12.3 
Waardenburg syndrome, type I (3) 2q35 
Waardenburg syndrome, type III, 148820 (3) 2q35 


Wagner syndrome, type II (3) 

Waisman Parkinsonism-mental retardation syndrome (2) 
?Walker—Warburg syndrome, 236670 (2) 

Watson syndrome, 193520 (3) 

Werdnig—Hoffmann disease (2) 

Werner syndrome (2) 

{Wernicke-Korsakoff syndrome, susceptibility to} (1) 
Wieacker—Wolff syndrome (2) 

Williams—Beuren syndrome, 194050 (3) 

Wilms’ tumour (3) 

Wilms’ tumour, type 2 (2) 

Wilson disease (3) 

Wiskott—Aldrich syndrome (3) 

?Wolf-Hirschhorn syndrome, 194190 (3) 
Wolf-Hirschhorn syndrome (2) 

Wolfram syndrome (2) 

Wolman disease (3) 

Woods neuroimmunologic syndrome (2) 

Wrinkly skin syndrome (2) 

Xanthinuria (1) 

?Xeroderma pigmentosum (1) 

Xeroderma pigmentosum, complementation group C (3) 
Xeroderma pigmentosum, group B (3) 

Xeroderma pigmentosum, group D, 278730 (3) 
Xeroderma pigmentosum, group G (3) 

Xeroderma pigmentosum, type A (3) 

?Xeroderma pigmentosum, type F (2) 

?XLA and isolated growth hormone deficiency, 307200 (3) 
Zellweger syndrome-1 (2) 

Zellweger syndrome-2 (3) 

Zellweger syndrome-3 (3) 


[ ], nondisease genes; { }, susceptibility genes. 


12q13.11-q13.2 
Xq28 

9q31-q33 
17q11.2 
5q12.2-q13.3 
8p12-p11 
3p14.3 
Xq13-q21 
7ql1.2 

11p13 

i fey Ses) 
13q14.3-q21.1 
Xp11.23-p11.22 
4p16.1 


10q24-q25 
Xq26-qter 
2q32 
2p23-p22 
1q42 

3p25 

2q21 
19q13.2-q13.3 
13q33 
9934.1 
Chr.15 
Xq21.3-q22 
7q11.23 
1p22-p21 
8q21.1 
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Appendix Vill Mouse gene knock-out 


tables 
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The ability to engineer specific mutations in mice by 
mutating a specific gene, by homologous recombination 
in embryonic stem cells and then transferring the 
mutation into a developing mouse, represented a major 
breakthrough in mouse genetics (see e.g. refs 1-3). The 
most common point of this exercise has been to inactivate 
the targeted gene and to observe the phenotypic effects of 
the ‘knock-out’ on the developing mouse (e.g. refs 4-6). 

Table VIII.1 includes the majority of targeted mouse 
mutants for which the exact gene mutation has been 
characterized. In several cases a targeted mutation has 
produced a phenotype similar to a previously studied 
mouse mutant, and has thus facilitated the molecular 
characterization of the previously established mutant 
[7-13]. 

The first column of Table VIII.1 gives the name of the 
protein product of the targeted gene (with abbreviated 
and/or alternative names in parentheses), or a description 
of the targeted genomic locus if it is not a coding sequence. 
If the entry is not a null mutant, this is also indicated by, 
for example, ‘modification’ or ‘partial’ (a ‘leaky’ mutation). 
The table is alphabetized by the first column. 

In cases where more than one group has performed 
essentially the same mutation, the number of 
independent groups is indicated in the second column 
(e.g. x2, x3). Also noted in the second column is whether 
the mutant has been crossed with another mutant derived 
by gene targeting, and studied as a double mutant; this is 
indicated by a ‘D’, preceded by the number of different 
double mutants (e.g. D, 2D). The double mutants are 
described in Table VIII.2. 

The third column gives a synopsis of the key aspects of 
the phenotype, generally those reported in the primary 


943 


reference(s) listed in the fourth column. When embryonic 
lethality is associated with the mutation this is indicated 
by an ‘e’ followed by the day(s) of gestation when the 
mutants die. Perinatal lethality is defined as death within 
the first 24-48 h after birth, and neonatal lethality is death 
before weaning. In many cases the perinatal and neonatal 
lethality is highly variable and may depend on the genetic 
background or environment, and the reader is referred to 
the primary references for the precise details. 

In a few instances, when more than one group has 
created essentially the same mutant, disparate results 
have been reported. Regardless of the reason for the 
disagreement, these findings are marked with an asterisk 
(*) in the third column. The fourth column gives the 
reference(s) that first described the mutant, while the fifth 
column lists subsequent reports describing the mutant, or 
utilizing the mutant as a tool for other experiments. 

In certain cases, unexpected roles for a protein have 
been discovered. In other cases, genetic redundancy or 
compensation allows partial function despite the absence 
of a protein thought to be crucial. Out of the 263 knock- 
outs listed in Table VIII.1, only about 25% lead to lethality 
before or just after birth, with another 10% resulting in 
death during the first 3-6 weeks. Most null mutants, 
however, survive into adulthood. Only a dozen or so of 
the mutants are apparently normal. 

The data for Tables VIII.1 and VIII.2 were originally 
compiled by E.P. Brandon, R.L. Izerda and G.S. McKnight 
and are reprinted here from Current Biology (5, 625-634; 
758-765; 873-881; [corrigenda] 1073) with permission. 

Listings of targeted mutations in mouse, pig, rat and 
Drosophila can be found in TBASE. WWW: 
http://www.gdb.org/Dan/tbase/tbase/.html 
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Table VIII.1 Published targeted mutations. 


Protein/locus x;D Phenotype Initial report(s) Follow-ups 
(see text) 
Ab1 x2 Perinatal lethality; multiple 14-16 iY 
developmental defects; 
lymphopenia 
Acetylcholine Loss of high-affinity nicotine 18 
receptor, nicotinic, binding in brain; abnormal 
B, subunit avoidance learning 
Acrosin No defects in fertilization 9 
Activin/inhibin BA D Perinatal lethal; whiskers and 20 
incisors absent; cleft palate 
Activin/inhibin BB D Eyelid defects; female repro- 21 
ductive defects 
Activin receptor Reduced FSH; reproductive defects 22 
type II (ActRcll) 
Adenomatous polyposis Postimplantation embryonic lethal; 23 
coli (APC) protein heterozygotes develop intestinal 
modification tumours 
Adenylyl cyclase type I Changes in long-term potentiation 24 
and spatial learning 
Adhesion molecule on Neonatal lethal; neural degeneration 25 
glia (AMOG) (B, subunit 
of Na,K-ATPase) 
Amyloid (B-) Increased incidence of corpus 26 
precursor protein callosum agenesis; behavioural 
modification deficits 
Angiotensinogen Hypotension 27 
Apolipoprotein AI Reduced HDL cholesterol 28 29 
Apolipoprotein B Hypobetalipoproteinaemia; exen- 30 
modification cephalus and hydrocephalus 
Apolipoprotein B e10-20 lethal; heterozygotes 31 
protected from diet-induced 
hypercholesterolaemia 
Apolipoprotein C-III Hypotriglyceridaemia 32 
Apolipoprotein E x2;D Hypercholesterolaemia and 33-35 36-43 
atherosclerosis 
Argininosuccinate Neonatal lethal; citrullinaemia; 44 
synthetase (ASS) hyperammonaemia 
Asialoglycoprotein Decreased HL-1 expression in liver 45 
receptor, minor 
subunit HL-2 
Atrial natriuretic peptide Salt-sensitive hypertension 46 
(ANP) 
B-cell lineage-specific Neonatal lethality; posterior 47 
activator protein (BSAP) midbrain morphological defects; 
(Pax5 gene) B-cell development disrupted 
B7 (CD28 ligand) Decreased co-stimulated response 48 
to alloantigen 
Bcl-2 x2 Neonatal lethality; lymphocytopenia; 49,50 
multiple growth defects; tremor; 
melanin synthesis defect; polycystic 
kidneys 
Bcl-x e13 lethal; neuronal and haemato- 51 
poietic apoptosis 
Bmi-1 Haematopoietic defects; ataxia; 52 
seizures; posterior transformation 
Brain-derived neuro- x2 Neonatal lethality; coordination 53754 
trophic factor (BDNF) deficiency; of sensory ganglia 


degeneration 


Continued. 
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Table VIII.1 Continued. 
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Protein/locus x;D Phenotype Initial report(s) Follow-ups 
(see text) 
Cadherin (E-cadherin) SD e4-5 lethal; trophectoderm and 55,56 
blastocoel do not form 
Calcium-calmodulin- Deficient hippocampal long- 57,58 59, 60 
dependent protein term potentiation and long-term 
kinase II a depression; impaired spatial 
(a-CaMKII) learning; seizure prone; abnormal 
fear and pain responses 
Casein (B-casein) Reduced casein micelles; reduced 61 
protein in milk, reduced growth 
of pups 
CD2 No defects observed 62 63 
CD4 x2;D Decreased helper T-cell activity 64, 65 66-77 
CD8-a (Lyt-2) D Absence of cytotoxic T cells 78 74-77, 79-86 
CD8-B Reduced thymic maturation of 87 
CD8* T cells 
CD18 partial Mild granulocytosis; impaired 88 
immune responses 
CD23 x3 Defects in IgE regulation and 89-91 
IgE-mediated signalling 
CD28 Decreased T-cell response to lectins; 92 93 
decreased IL-2Ra, IgG1, and 
IgG2b 
CD40 Defects in thymus-dependent 94 
humoral immunity 
CD40 ligand (CD40L) x2 Defects in thymus-dependent 95,96 
humoral immunity 
CD45 exon 6 Impaired T-cell maturation OF 98 
Cellular retinoic acid No defects observed 99 
binding protein 
(CRABP-I) 
Ciliary neurotrophic Motor neuron degeneration; 100 
factor (CNTF) muscle weakness 
Collagen o (IX) Non-inflammatory degenerative 101 
joint disease 
Collagen a (V) Neonatal lethality; abnormalitiesin 102 
modificationn spine, skin and eyes 
Collagen (X) No defects observed 103 
Connexin 43 Perinatal lethal; cardiac 104 
malformation 
Corticotropin-releasing Decreased adrenal corticosterone 105 
hormone (CRH) release in response to stress; 
offspring of homozygous mother 
perinatal lethal due to lung 
dysplasia 
Creatine kinase No burst activity in skeletal muscle 106 107, 108 
Csk x2 e10 lethal; no notochord; increased 109,110 
Sre and Fyn activity 
Cyclic AMP-responsive Lack late phase of CAl long-term _111 112 
element binding protein potentiation; decreased long 
(CREB) o and 6 isoforms term memory; increase in CREM 
Cystathionine B-synthase Neonatal lethality; growth 113 
(CBS) retardation; abnormal hepatic 
morphology 
Cystic fibrosis trans- x3 Neonatal lethality; meconium ileus; 114-11 7 118-124 
membrane regulator defective epithelial chloride 
(CFTR) transport 


Continued on p. 946. 
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Table VIII.1 Continued. 


Protein/locus x;D Phenotype Initial report(s) Follow-ups 
(see text) 
Cystic fibrosis trans- Neonatal lethality; meconium ileus; 125 126-128 
membrane regulator defective epithelial chloride 
(CFTR) partial transport 
Cytochrome b, phagocyte- Increased susceptibility to 129 
specific oxidase, 91 kD pathogens; model for X-linked 
subunit chronic granulomatous disease 
DNA methyltransferase e10-12 lethal; defective expression 130 131 
of imprinted genes 
DNA polymerase B Demonstrates feasibility of tissue- 132 
modification specific disruption using 
Cre-loxP system 
Dopamine D1 receptor x2 Premature lethality without wet 133, 134 135 
(D1R) food; growth retardation; 
reduced dynorphin and 
substance P expression; hyper- 
activity; decreased rearing 
behaviour 
E2A x2 Neonatal lethality; growth 136,137 
retardation; lack B cells 
En-1 Perinatal lethal; multiple 138 
developmental defects 
En-2 Abnormal cerebellar foliation 139 140 
Endothelin-1 Perinatal lethal; craniofacial 141 
abnormalities; high blood 
pressure in heterozygotes 
Endothelin-3 Aganglionic megacolon; white 12 
spotting of skin and coat; allelic 
with lethal spotting (Is) 
Endothelin-B receptor Aganglionic megacolon; white 13 
spotting of skin and coat; 
allelic with piebald (s) 
Evx1 (even-skipped Early postimplantation lethal 142 
homologue) 
Excision repair cross- Neonatal lethal; liver failure and 143 
complementing protein aneuploidy 
(ERCC-1) 
Fe receptor g subunit Pleiotropic effector cell defects 144 
Fer D No defects observed 145 
Fibroblast growth Tail and inner ear develop- 146 
factor 3 (FGF3) mental defects 
(int-2) 
Fibroblast growth e4—6 lethal; inner cell mass 147 
factor 4 (FGF4) does not develop 
Fibroblast growth Long hair; allelic with angora(go) 11 
factor 5 (FGF5) 
Fibroblast growth oo e7-9 lethal; abnormal mesoderm 148, 149 
factor receptor 1 patterning 
(FGFR-1) 
Fibronectin e9-10 lethal; defectsin mesoderm 150 
development 
FMRI protein Macroorchidism; hyperactivity Syl 
Follistatin Perinatal lethal; multiple 152 
developmental defects 
Fos x2 Perinatal lethality; osteopetrosis; 153, 154 155-160 


defects in gametogenesis and 
haematopoiesis 


a 
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Table VIII.1 Continued. 
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Protein/locus x;D Phenotype Initial report(s) Follow-ups 
(see text) 
Fumarylacetoacetate Perinatal lethal; hepatic 10 
(FAH) dysfunction; allelic with lethal 
albino (alf/hsdr-1) 
Fyn (p59) x2; 2D Signalling defect in thymocytes 17, 161, 162 163-167 
but not peripheral T cells; 
impaired long-term potentiation; 
abnormal olfactory glomeruli and 
hippocampal morphology; 
suckling defect 
Fyn (p59)67 Signalling defective in thymocytes 168 
but not peripheral T cells 
GATA-2 e10-11 lethal; severe anaemia 169 
Glial fibrillary acidic No defects observed 170 
protein (GFAP) 
Globin (B-globin b1) Premature lethality; anaemia 171 
Glucocerebrosidase Perinatal lethal; lysosomal storage 172 
defect; model for Gaucher’s 
disease 
Glutamate receptor, x2 Ataxia; impaired cerebellar long- 173-175 
metabotropic type 1 term depression and conditioned 
(mGluR1) eyeblink response; impaired 
long-term potentiation, spatial 
learning and context-dependent 
fear conditioning 
Glutamate receptor, x2 Perinatal lethal; respiratory failure 176,177 
NMDA type 1 
(NMDAR1) 
Glutamate receptor, Reduced CA1 long-term 178 
NMDA type e potentiation; spatial learning 
(NMDAR €) (NR2A) defect 
Granulocyte colony- Granulopoietic defects 179 
stimulating factor 
(G-CSF) 
Granulocyte-macrophage x2 Pulmonary pathology; apparently 180,181 182 
colony-stimulating factor normal haematopoiesis 
(GM-CSF) 
Granzyme B Cytotoxic T-lymphocyte defect 183 
Growth-associated Perinatal and neonatal lethality; 184 
protein-43 (GAP-43) abnormal pathfinding at 
the optic chiasm 
Hck D Phagocytosis impaired; 145 
increased Lyn activity 
Hepatic lipase Mild dyslipidaemia 185 
Hepatocyte x2 e13-16 lethal; placental defect; 186, 187 
factor /scatter factor small liver 
(HGF/SF) 
Hepatocyte nuclear x2 e10-11 lethal; disorganized node 188, 189 
factor 3b (HNF-3b) and notochord 
Hepatocyte nuclear e6 lethal; ectodermal cell death; 190 
factor 4 (HNF-4) impaired gastrulation 
Hexosaminidase Accumulation of ganglioside in 191 
(B-hexosaminidase central nervous system; model 
o subunit) for Tay-Sachs disease 
Hox 11 No spleen 192 
Hox-A1 (Hox 1.6) x2 Perinatal lethal; hindbrain 193, 194 195-197 


reorganization; cranial nerve 
and inner ear defects 
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Protein/locus x;D Phenotype Initial report(s) Follow-ups 
(see text) 
Hox-A2 (Hox 1.11) x2 Perinatal lethal; homeotictransfor- 198,199 
mation of rostral head 
Hox-A3 (Hox 1.5) D Perinatal lethal; athymic; aparathy- 200 
roid; throat, heart, arterial, and 
craniofacial abnormalities 
Hox-A4 (Hox 1.4) Rib and sternal defects 201 
Hox-A5 (Hox 1.3) Perinatal lethality; cervical and 202 
thoracic homeotic trans- 
formations 
Hox-A11 (Hox 1.9) Homeotic transformations; skeletal 203 
malformations 
Hox-B4 (Hox 2.6) Neonatal lethality; cervical 204 
homeotic transformation; sternal 
defects 
Hox-B4 (Hox 2.6) Cervical homeotic transformation 204 
truncation 
Hox-B5 (Hox 2.1) Rostral shift in shoulder girdle; 205 
homeotic transformation of 
vertebrae C6 through T1 
Hox-B6 (Hox 2.2) Missing first rib; bifid second rib; 205 
homeotic transformation of 
vertebrae C6 through T1 
Hox-C8 (Hox 3.1) Neonatal lethality; skeletal 206 
transformations 
Hox-D3 (Hox 4.1) D Transformations of anterior 207 
vertebrae (atlas and axis) 
Hox-D11 (Hox 4.6) xe Vertebral homeotic trans- 208, 209 
formations; other skeletal 
abnormalities 
Hox-D13 (Hox 4.8) Skeletal alterations along allbody 210 
axes; males infertile 
Hypoxanthine-guanine Demonstrates germline trans- 211 
phosphoribosyl- mission of a genetic correction 
transferase (HPRT) introduced by homologous 
correction recombination in embryonic 
stemcells 
Ik (Ikaros gene products) Neonatal lethality; reduced size; Ze 
lymphocytes and lymphoid 
progenitors absent 
Immunoglobulin D x2 Reduced number of mature B cells* 213,214 
Immunoglobulin E No defects observed 215 
Immunoglobulin E Resistant to cutaneous and 216 
receptor a chain systemic anaphylaxis 
Immunoglobulin « No Ig« rearrangement; slight PAINT 218 
intron enhance reduction in splenic B cells 
Immunoglobulin k x2 Reduced number of B cells 218,219 220 
light chain 
Immunoglobulin « B cells produce human—mouse Zi 
replaced with human chimeric k-bearing antibodies 
constant region 
Immunoglobulin Absence of B cells Pa, 223-225 
membrane exon 
Inhibina D Gonadal tumours; both males 226 227,228 
and females sterile 
Insulin receptor x2 Reduced size; impaired glucose 229, 230 


substrate-1 (IRS-1) 


tolerance; decrease in insulin-, 
IGF-1- and IGF-2-induced 
glucose uptake 
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Protein/locus 


x;D Phenotype Initial report(s) Follow-ups 
(see text) 
Insulin-like x2; 2D Perinatal lethality (background 230-233 
growth factor I (IGF-I) strain dependent); 60% normal 
birthweight; infertile; under- 
developed muscle tissue; lung 
defects 
Insulin-like growth 2D 60% normal birthweight; 234 2817232. 
factor II (IGF-II) (heterozygotes with paternally 235-237 
inherited null alllele similar to 
homozygotes; gene imprinting 
implicated) 
Insulin-like growth 2D Perinatal lethal; organ hypoplasia; 231, 232 238-240 
factor receptor 1 (IGF1R) respiratory failure; 
45% normal birthweight 
Insulin-like growth x2. Perinatal lethal; 30% larger at birth; 241,242 
factor receptor 2 organ and skeletal abnormalities; 
(IGF2R) (mannose (heterozygotes with maternally 
6-phosphate receptor inherited null allele similar to 
300; MPR300) homozygotes; gene imprinting 
implicated) 
Insulin-promoter- Neonatal lethal; no pancreas 243 
factor-1 (IPF-1) 
Integrin (05 integrin) e10-11 lethal; mesodermal defects 244 
Intercellular adhesion x2 Leucocytosis; impaired inflam- 245, 246 
molecule-1 (ICAM-1) matory and immune responses 
Interferon of receptor Antiviral defence impaired 247 
Interferon y Multiple immune response defects 248 249-252 
Interferon y receptor Multiple immune response defects 253 254-256 
Interferon regulatory x2 Decreased CD4°8* T cells;impaired 257,258 259-261 
factor 1 (IRF-1) interferon y response 
Interferon regulatory Premature lethality; defects in 257 
factor 2 (IRF-2) haematopoiesis; immuno- 
compromised 
Interleukin-1f-converting Decreased IL-1 production; 262 
enzyme (ICE) resistance to endotoxic shock 
Interleukin-2 (IL-2) D Premature lethality; normal 263 264, 265 
T-cell subset composition, but 
dysregulated immune system, 
inflammatory bowel disease 
Interleukin-2 receptor Lymphopenia; absence of NK cells 266 
y chain (IL-2Ry) 
Interleukin-4 (IL-4) x2; D CD4* (Th2)-produced cytokines 267, 268 269, 270 
reduced; serum IgG1 and 
IgE reduced 
Interleukin-6 (IL-6) x2 Higher bone turnover rate; no bone 271,272 273-275 
loss when ovariectomized; 
immune defects; reduced 
IgA-producing cells 
Interleukin-7 receptor Early lymphocyte expansion 275 
(IL-7R) severely impaired 
Interleukin-8 receptor Lymphadenopathy and 277 
(IL-8R) splenomegaly; increased B cells 
and neutrophils 
Interleukin-10 (IL-10) Reduced growth; anaemia; 278 
chronic enterocolitis 
Invariant chain (Ii) x2 MHC class II transport and 279, 280 281, 282 


function defective; reduced 
CD4+ T cells 
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Protein/locus x;D Phenotype Initial report(s) Follow-ups 
(see text) 
J,-E,, immunoglobulin Suppression of switch recombi- 283 223, 284 
heavy chain (joining nation at [1 gene; absence 
and enhancer regions) of B cells 
J,immunoglobulin x2 Absence of B cells 285, 286 
joining region 
J, teplaced with Rearranged V transgene 287 
rearranged V region expressed in all B cells 
Jun x2 e11-16 lethal; impaired hepato- 288, 289 
genesis*; defective fetal liver 
erythropoiesis*; oedema; 
decreased growth of embryonic 
fibroblasts 
Keratin 8 e12 lethal; internal bleeding; 290 291 
abnormal fetal liver; a few mice 
survive to adulthood 
Krox-20 x2 Perinatal lethal; defects in hind- 292,293 294 
brain and associated cranial 
sensory ganglia 
L14 s-type lectin No defects observed 295 
Lactalbumin Females can’t nurse offspring; 296 
(a-lactalbumin) milk viscous; no lactose 
Lactalbumin (a- Demonstrates germline trans- 297 
lactalbumin replaced by mission of ES cells that have 
human lactalbumin undergone double-replacement 
targeting 
A5 Defective B cell development 298 223 
Laminin(s-laminin / Neonatal lethal; proteinuria; 299 
laminin B2) neuromuscular junction defects 
Lek (p56!) Thymic atrophy; reduced CD4* 8* 300 301, 302 
T cells; very few mature 
T cells; immunocompromised 
Leukaemia inhibitory x2 Decreased haematopoietic stem 303, 304 305 
factor (LIF) cells; deficient neurotransmitter 
switch in vitro but normal 
sympathetic neurons in vivo; 
blastocysts do not implant in 
homozygous mother 
Lipoxygenase x2 Resistance to certain inflam- 306, 307 
(5-lipoxygenase) matory agents 
LMP-7 Defects in MHC classI expression 308 
and antigen presentation 
Low density lipoprotein D Hypercholesterolaemia; 309 310, 311 
receptor (LDLR) increased apoB-100 
Low-density lipoprotein Embryonic lethal; failed 312 313 
receptor-related protein implantation of embryos 
(LRP) 
Lymphoid enhancer Neonatal lethal; defects in 314 
factor 1 (LEF-1) development of multiple 
organs 
Major histocompatibility Decreased CD4*8°-T cells; 315 
complex class II Ao, immune defects 
(MHC II Aq) 
Major histocompatibility x2; D Decreased CD4* 8-T cells; deficient 316,317 318-328 
complex class II AB cell-mediated immunity; some 
(MHC II Af) B-cell dysfunctions; inflam- 


matory bowel disease 
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Protein/locus x;D Phenotype Initial report(s) Follow-ups 
(see text) 
Mammalian achaete Neonatal lethal; olfactory and 329 
scute homologue 1 autonomic neuron deficiency 
(Mash-1) 
Mammalian achaete scute e10 lethal; placental defects; 330 331 
homologue 2 (Mash-2) (heterozygotes with paternally 
inherited null allele similar to 
homozygotes; gene imprinting 
implicated) 
Mannose 6-phosphate x2 Defects in targeting /retention UO2; 300 
receptor 46 (MPR46) of lysosomal enzymes 
(cation-dependent 
mannose 6-phosphate 
receptor; CD-MPR) 
MetallothioneinI (MT-I) x2 Sensitive to heavy metal 334, 335 336 
and metallothionein II 
(MT-I) 
Microglobulin SAID) Decreased CD4°8t T cells 337-338 318, 323-328, 
(B,-microglobulin) 340-368 
Mos SOD Reduced female fertility; 369, 370 
parthenogenesis 
Mp1 Thrombocytopenia; increased 371 
serum thrombopoietin 
Msx-1 Cleft palate; tooth and craniofacial 372 
abnormalities 
Mullerian-inhibiting Males infertile due to development 373 
substance (MIS) of female reproductive organs; 
Leydig cell hyperplasia 
Multiple drug resistance Blood-brain barrier defect; drug 374 
protein 1a (mdrla) sensitivity 
Multiple drug resistance Liver disease; lack of phos- 375 376 
protein 2 (mdr2) pholipid secretion into bile 
Myb e15 lethal; defect in haematopoiesis 377 
Myc (c-myc) e10 lethal; heart and neural tube 378 
abnormal 
Myc (N-myc) x3 e10-12 lethal; development of 379-381 382, 383 
several organs affected 
Myc (N-myc) partial Perinatal lethal; lung defect 384 383 
Myelin-associated x2 Oligodendrocyte abnormalities; 385, 386 
glycoprotein (MAG) subtle tremor; increased NCAM 
expression; 
Myf-5 D Neonatal lethal; incomplete rib 387 388 
development 
Myf-5B-galactosidase Used to study Myf-5 expression 389 390, 391 
insertion during early development 
MyoD D No obvious defects but reduced 392 
survival; increased Myf-5 
Myogenin x2 Perinatal lethal; decreased skeletal 393, 394 
muscle 
Myristoylated alanine- Perinatal lethal; defects in brain B25 
rich C-kinase substrate development 
(MARCKS) 
N-acetylglucosaminyl- x2 e10 lethal; defects in neural tube 396, 397 


transferase I (GleNAc-TD) 


formation, vascularization, and 
determination of left-right 


asymmetry 
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Protein/locus x;D Phenotype Initial report(s) Follow-ups 
(see text) 
NAD*:protein (ADP- Susceptible to skin disease 398 
ribosyl) transferase 
(ADPRT) 
Nerve growth factor Neonatal lethal; sensory and 399 
(NGF) sympathetic neurons drastically 
reduced; forebrain ACh neurons 
still present 
Nerve growth factor Decreased sensory innervation; 400 401, 402 
receptor (NGFR) secondary infections and 
(low affinity, p75) ulceration 
Neural cell adhesion Small olfactory bulb; deficient 403 
molecule (NCAM) spatial learning 
Neural cell adhesion Defective migration of olfactory 404 
molecule-180 neurons; small olfactory bulb 
(NCAM-180) 
Neurofibromatosis e11—13 lethal; malformation of 405 406 
type 1 gene (NF1) heart; hyperplasia of sym- 
pathetic ganglia; heterozygotes 
predisposed to tumours 
Neurotrophin-3 (NT-3) x3 Neonatal lethal; peripheral sensory 407-409 
and sympathetic neurons 
reduced; limb proprioceptive 
afferents absent 
NF-IL6 Defects in macrophage bactericidal 410 
and tumoricidal activities 
NF-«B p50 subunit Multifocal defects in immune 411 
responses 
Nitric oxide synthetase, Stomach hypertrophy; normalCA1 412 413,414 
neuronal (nNOS) long-term potentiation 
Notch 1 e10-11 lethal; widespread cell death 415 
Oct-2 Perinatal lethal; decreased IgM* 416 
B cells 
Oestrogen receptor Females infertile; males have 417 
reduced fertility 
1p Hypomyelination; myelindegen- 418 
eration; tremors fertility 
po3 x3; 2D Spontaneous tumours; thymocytes 419-421 422-438 
resistant to apoptosis by radiation 
or etoposide 
Parathyroid hormone- Perinatal lethal; abnormal chon- 439 
related peptide (PTHrP) drocyte and bone development 
Perforin x3 Impaired CTL and NK cell function; 440-442 443 
unable to clear LCMV infection 
Pim-1 Impaired response of early B cells to 444 445-447 
interleukin-7 and steel factor; 
impaired response of bone 
marrow-derived mast cells 
to interleukin-3 
Plasminogen activator Mildly hyperfibrinolytic; 448, 449 
inhibitor-1 more resistant to thrombosis 
Platelet-derived growth Perinatal lethal; kidney defect; 450 
factor B (PDGF B) haemorrhagic; erythroblastosis; 
macrocytic anaemia; throm- 
bocytopenia 
Platelet-derived growth Perinatal lethal; kidney defect; 451 


factor receptor B (PDGFbR) 


haemorrhagic; anaemic; throm- 


bocytopenia 
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Protein/locus x;D Phenotype Initial report(s) Follow-ups 
(see text) 
Prion protein (PrP) Resistant to scrapie; weakened 452 453-459 
GABA,R-mediated fast in- 
hibition; impaired long-term 
potentiation 
Protein kinase Cy (PKCy) Deficient long-term potentiation; 460-461 
impaired context-dependent 
fear conditioning 
Proteolipid protein Disrupted myelination; decreased 462 
(PRP/DM20) axonal conduction velocity; 
behavioural changes 
PU.1 e16-18 lethal; defect in develop- 463 
ment of lymphoid and myeloid 
cells 
Rab3A Induced synaptic depression 
increased 464 
Ras (N-Ras) No defects observed 465 
Rbtn2 e10 lethal; absence of erythrocytes 466 
Recombination activation 2 Absence of mature B and 467, 468 322, 469-473 
gene 1 (RAG-1) T lymphocytes 
Recombination activation Absence of mature B and 474. 475-479 
gene 2 (RAG-2) T lymphocytes 
RelB SP Multiorgan inflammation; 480, 481 
haematopoietic defects 
Ret Neonatal lethal: kidney and 482 
enteric nervous system 
defects 
Retinoblastoma-1 (Rb-1) x2; D e12-15 lethal; neural and 483, 484 436 
tumours in heterozygotes 
haematopoietic defects; 
pituitary 
Retinoblastoma-1 (Rb-1) e12-15 lethal; neural and 485 486, 487 
truncation haematopoietic defects 
Retinoic acid receptor o% 61D) Neonatal lethal; testicular 488 
(RARq) degeneration 
Retinoic acid receptor a1 x2;2D No defects observed 488, 489 
(RARa1) 
Retinoic acid receptor B2 3D No defects observed 490 
(RARB2) 
Retinoic acid receptor y 4D Neonatal lethality; growth 49] 
(RARy) deficiency; glandular, skeletal, 
and cartilage defects 
Retinoic acid receptor y2 No defects observed 492 
(RARy2) 
RXRa x2; 2D e13-16 lethal; eye, heart, and 492, 493 
liver defects 
Ryanodine receptor Perinatal lethal; skeletal muscle 494 
defects; excitation—contraction 
uncoupled 
Selectin (L-selectin) Defects in lymphocyte homing 495 
and leucocyte rolling 
and migration 
Selectin (P-selectin) Defects in leucocyte behaviour; 496 
increased neutrophils 
497 


Serotonin receptor 1B 
(5-HT 1B receptor) 


Aggressive behaviour 
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Protein/locus x;D Phenotype Initial report(s) Follow-ups 
(see text) 
SF-1 (Ftz-F1) Neonatal lethal; lack adrenals; 498 499 
both genders have female 
internal genitalia only 
syl class switch region Shutdown IgM-IgG class switch 500 
at that allele 
Src 2D Osteopetrosis 501 17, 165, 166, 
502-507 
Srm No defects observed 508 
Synapsin I Increased paired pulse facilitation 509 
Synaptotagmin I Perinatal lethal; synaptic trans- 510 
mission severely impaired in 
cultured neurons 
T-cell factor-1 (TCF-1) Defect in thymocyte development 511 
T-cell receptor a(TCRa) x2;D Loss of thymic medullae; devoid 471,512 323, 473, 
of single positive thymocytes; D13-016 
no ab T cells; inflammatory 
bowel disease 
T-cell receptor B (TCRB) 2D Reduced % CD4*8+, and total 471 322, 469, 513, 
number of thymocytes; 514 
inflammatory bowel disease 
T-cell receptor 6 (TCR8) D Absence of 75 T cells Syl 513,514 
T-cell receptor n (TCRn) Neonatal lethal; (partial knockout 518 
of Oct-1 on opposite strand) 
T-cell receptor no (TCR) Lower birth rate; T cells develop Sle) 
normally; (partial knockout of 
Oct-1 on opposite strand) 
T-cell receptor €¢(TCRC) = x3 Decreased CD4*8* thymocytes and 520-522 523 
single positive T cells; low 
TCR expression 
T-cell receptor Cn (TCRCn) Decreased CD4*8* thymocytes and 524 
single positive T cells; low 
TCR expression 
Tal-1 (SCL) e9-10 lethal; haematopoietic defect 525 
Tau Altered microtubules in small 526 
calibre axons 
Tek receptor tyrosine e8-9 lethal; decreased endothelial 527 
kinase (Tek RTK) cells; cardiac defects 
Tenascin-C No defects observed 528 Sys) 
Terminal deoxynucleo- (TdT) Decreased TCR diversity 530 
tidyltransferase 
Thrombomodulin e8-9 lethal; growth retardation 531 
Tissue plasminogen D Impaired clot lysis 532 
activator (tPA) 
Transforming growth x2 Hair follicle and eye defects; 8,9 
factor a (TGFa) allelic with waved-1 (wa-1) 
Transforming growth x2 Neonatal lethal; multifocal 533, 534 535-540 
factor B1 (TGFB1) inflammatory disease 
Transporter associated MHC class I transport and function 541 542-546 
with antigen processing 1 defective; lack CD4-8+) 
(TAP1) 
Transthyretin and thyroid Decreased serum retinol, 547 548,549 
hormone retinol-binding protein 
TrkA Neonatal lethal; severe sensory 550 


and sympathetic neuropathies; 
decreased forebrain ACh neurons 


en ee 
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Protein/locus x;D Phenotype Initial report(s) Follow-ups 
(see text) 
TrkB Neonatal lethal; deficiencies in 551 
central and peripheral nervous 
system 
TrkC No Ia muscle afferents; propri- 552 
oception disrupted 
Tumour necrosis factor x2 Resistant to endotoxic shock; 553, 554 555 
receptor 1 (TNF-R-1) susceptible to Listeria infection 
(p55) 
Tumour necrosis factor Resistance to TNF-induced 556 
receptor 2 (TNF-R-2) necrosis and death 
(p75) 
Tumour necrosis factor-B No Peyer’s patches or lymph oo7, 
(TNF-B) (lymphotoxin) nodes; increased IgM" B cells 
Urate oxidase Neonatal lethality; hyperuricaemia; 558 
urate nephropathy 
Urokinase plasminogen D Occasional fibrin deposition 532 
activator (uPA) 
Vascular cell adhesion e8-10 lethality; chorioallantoic 559 
molecule-1 (VCAM-1) fusion disrupted; surviving adults 
have elevated mononuclear 
leucocytes 
Vav e4-7 lethal; possible trophoblast 560 
defect 
Vimentin No defects observed 561 
Wilms’ tumour protein 1 ell lethal; kidney apoptosis; 562 
(WT-1) gonadal, lung, and heart defects 
Wnt-1 (int-1) Neonatal lethality; cerebellum and 563, 564 565, 566 


midbrain absent; severe ataxia; 
allelic with swaying (sw) 


Wnt-3a e10 lethal; no hind portion 567 
Wnt-4 Perinatal lethal; kidney defects 568 
Wnt-7a Limb abnormalities; sterile 569 
Yes 2D No defects observed 570 17, 165, 166, 


496 


Table VIII.2 Double knockouts. 
es ee ee ee OP i Bie 2, ee 


Mutants crossed Phenotype Initial report(s) Follow-up(s) 
Activin /inhibin BA & Activin/ Perinatal lethal; whiskers 20 
inhibin BB and incisors absent; cleft 
palate; eyelid defects 
Apolipoprotein E (apoE) and low- ApoB-48 and apoB-100 both 311 
density lipoprotein receptor (LDLR) elevated 
CD4 & CD8 Some cytotoxic T cells are 571 572 
still present 
Fer & Hck Increased susceptibility to 145 
Listeria infection 
Fyn & Src Neonatal lethal; reduced size; 567 
osteopetrosis 
Fyn & Yes Premature lethality; degen- 567 
erative renal changes 
Hox-A3 (Hox 1.5) and Hox-D3 Synergistic defects; atlas 572 
(Hox 4.1) deleted 
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Mutants crossed Phenotype Initial report(s) Follow-up(s) 
Inhibin (a) and p53 Gonadal tumours 574 
Insulin-like growth factor I 30% normal birthweight 231,232 
(IGF-I) and insulin-like growth 
factor II (IGF-II) 
Insulin-like growth factor Appear identical to IGFIR 231,232 
receptor 1 (IGF1IR) and insulin-like knockout 
growth factor I (IGF-I) 
Insulin-like growth factor Perinatal lethal; 30% normal Dole ao” 
receptor 1 (IGF1IR) and insulin-like birthweight 
growth factor II (IGF-II) 
Interleukin-2 (IL-2) and Paradoxical increase in T-cell BIS 
interleukin-4 (IL-4) proliferation 
Microglobulin (B,-microglobulin) Depleted of CD4* 8- and 328, 576 318, 577 
and major histocompatibility CD4 8+ T cells 
complex II (MHC II) 
MyoD1 & Myf-5 Neonatal lethal; no skeletal 578 
muscle 
p53 & retinoblastoma-1 (Rb-1) Decrease in the ectopic apoptosis 436 
in lens fibre cells that is observed 
in Rb-1 single mutants 
RARo & RARB2 Skeletal malformations; homeotic 579, 580 
transformations; middle ear 
ossicle fusions; eye, oesophago- 
tracheal, thymus, thyroid, 
parathyroid, heart, and 
urogenital defects 
RARa & RARy e13-16 lethality; reduced size; 579, 580 
exencephaly; middle ear ossicle, 
fusions; eye, thymus, thyroid, 
parathyroid, heart, umbilical, 
gland, and urogenital defects; 
craniofacial and skeletal 
malformations; homeotic 
transformations 
RARy & RXRo e13-16 lethal; eye and heart defects 493 
RARol & RARB2 Middle ear ossicle fusions; eye, 579, 580 
oesophagotracheal, thymus, 
thyroid, parathyroid, heart, 
and urogenital defects 
RARal & RARy Skeletal malformations; homeotic 
transformation; gland defects 579, 580 
RARB2 & RARy Skeletal and cartilage malfor- 579, 580 
mations; homeotic transfor- 
mations; eye, thyroid, gland, 
and urogenital defects 
RARy & RXRa e13-16 lethal; eye and heart defects 493 
Src & Yes Mostly neonatal lethal; reduced 570 
size; osteopetrosis 
T-cell receptor o (TCR) and T-cell Reduced % CD4* 8* cells, and 471 
receptor B (TCRB) total number of thymocytes 
T-cell receptor B (TCRB) and Devoid of CD4* 8° cells and single 471 322, 573 
T-cell receptor 5 (TCR8) positive thymocytes; inflammatory 
bowel disease 
Tissue plasminogen activator (tPA) Extensive fibrin deposition 532 


and urokinase plasminogen 
activator (uPA) 
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Appendix |X Chromosome aberrations 
associated with cancer 


Tables IX.1-IX.10 refer to haematological malignancies 
(see Chapter 7); Tables IX.11 and IX.12 refer to solid 
tumours (see Chapter 8). 


Table IX.1 Chromosome abnormalities in acute myeloid 
leukaemia (AML) 

Table IX.2 French-American-British classification of 
acute myeloid leukaemia 

Table IX.3 Chromosome changes in acute lymphoid 
leukaemia (ALL) 

Table IX.4 French—-American-British classification of 
acute lymphoid leukaemia 

Table IX.5 Common chromosome changes in 
myelodysplastic syndromes (MDS) excluding chronic 


myelomonocytic leukaemia (CMML) 

Table IX.6 French-American-British classification of 
myelodysplastic syndromes excluding chronic 
myelomonocytic leukaemia 

Table IX.7 Common chromosome changes in 
myeloproliferative disorders (MPD) 

Table IX.8 Classification of myeloproliferative disorders 
Table IX.9 Chromosome abnormalities in lymphomas 
Table IX.10 Chromosome changes in chronic 
lymphoproliferative disorders 

Table IX.11 Consistent chromosome rearrangements in 
solid tumours 

Table IX.12 Gene amplifications associated with solid 
tumours 


Table IX.1 Chromosome abnormalities in acute myeloid leukaemia (AML). 


Chromosomal abnormality 


Association with disease 


der (1)t(1;7)(p11;p11) 


Secondary AML mostly M4? 


ins(3;3)(q26;q21q26) 

inv(3)(q21q26) 

t(3;3)(q21;q26) Abnormal megakaryocytes and thrombocytosis 
trisomy 4 M2 and M4 or M5 

monosomy 5 Secondary AML 

del(5q) Secondary AML 

t(6;9)(p23;q34) M2 and M4 (see Fig. 7.6 in Chapter 7) 
monosomy 7 Secondary AML 

del(7q) Secondary AML 

trisomy 8 Myeloid disease 

t(8;21)(q22;q22) M2 with Auer rods, eosinophilia 
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Table IX.1 Continued. 


SE SE ee eA ns Po 


Association with disease 


Chromosomal abnormality 


t(9;11)(p21;q23) 
t(9;22)(q34;q11) 
t(10;11)(p14;q13) 
del/t(11q) 
del/t(12p) 
t(15;17)(q22;q21) 
t(16;16)(p13;q22) 
inv(16)(p13q22) 
del(16)(q22q24) 
del(20q) 


missing Y 


M5 mostly M5a 


M1 and M2> 
M4 and M5 


M4, M5, mostly M5a 
Secondary AML 
MB and MBv (see Fig. 7.8in Chapter 7) 


M4Eo 
M6 


Age-related phenomenon 


*M1, M2, etc. are types of acute myeloid leukaemia; see Table IX.2. 
*This translocation is mostly associated with CML/CGL butis also found in AML and ALL. 


eee 


Classification Description 

Mi Myeloblastic without maturation 

M2 Myeloblastic with maturation 

M3 Promyelocytic (hypergranular) 

M3 variant Promyelocytic (hypo- or microgranular) 
M4 Myelomonocytic 

M4Eo M4 with eosinophilia 

M5a Monoblastic 

M5b Promonocytic-monocytic 

M6 Erythroblastic (<30% blasts if >50% erythrocytes) 
M7 Megakaryoblastic 

MO Myeloblastic with minimal differentiation 


Se eee 


Chromosomal abnormality 


Association (if any) 


t(8;14)(q24;q32) L3* poor prognosis 
t(8;22)(q24;q11) L3 poor prognosis 
t(2;8)(p12;q24) L3 poor prognosis 
duplications of 1q 

t(1;19)(q23;p13) Pre B-cell ALL L1 

t(1;11 )(p32;q23) Pre B-cell ALLL1 
t(1;14)(p32;q11) T-lineage ALL 
t(8;14)(q24;q11) T-lineage ALL 
t(10;14)(q24;q11) T-lineage ALL 

t(11 714)(p1 3;q11) T-lineage ALL 
t(7;14)(q35;q11) T-lineage ALL 
inv(14)(q11q32) and t(14;14)(q11;q32) Adult T-cell leukaemia 
i(6p) 

del(6q) Common ALL, T- or B-lineage, 


intermediate prognosis 


Continued. 


Table IX.2 French-American— 
British classification of acute 
myeloid leukaemia. 


Table IX.3 Chromosome 
changes in acute lymphoid 
leukaemia (ALL). 
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Table IX.3 Continued. 


Table IX.4 French-American— 
British classification of acute 
lymphoid leukaemia. 


Chromosomal abnormality 


Association (if any) 


dic(7;9)(p11;p11) 

i(7q) 

del(9)(p21) ALLL1 and L2, T- or B-lineage 

t(9;22)(q34;q11) Poor prognosis, immature 
B-cell ALL 

t(4;11)(q21;q23) Poor prognosis, immature 
B-cell ALL 

t(11;19)(q23;p13) Biphenotypic acute leukaemia 

t(9;11)(p22;q23) Biphenotypic acute leukaemia 

t(11;17)(q23;p13) Biphenotypic acute leukaemia 

del(12)(p11p13) Common ALL L1 or L2 

trisomy 6 

trisomy 8 

trisomy 18 

trisomy 21 

hyperdiploidy (+50 chromosomes) Early B precursor ALL, 
favourable prognosis 

near haploidy (<30 chromosomes) Common ALL, very poor 
prognosis 

hypodiploidy (30-39 chromosomes) Mainly observed in adult ALL 


‘L3 is a classification type of ALL (see Table IX.4). Note the involvement of 
immunoglobulin gene loci (e.g. 14q32, IGH) and T-cell receptor gene loci (e.g. 14q11, 
TCRA and TCRD) in the breakpoints of chromosomes. 


Classification 


Description 


JEil 


JEP 


L3 


Blasts are mainly small and relatively uniform in 
appearance. The nucleocytoplasmic ratio is high. 
Nuclei are predominantly round, nucleoli 
inconspicuous. Chromatin pattern diffuse, 
smaller blasts show some chromatin 
condensation. Cytoplasm is scanty and slightly to 
moderately basophilic. Some cytoplasmic 
vacuolation may be present. 


Blasts are larger than in L1, and more 
heterogeneous. The nucleocytoplasmic ratio is 
lower. Nuclei are more pleomorphic, with some 
nuclei being indented, cleft or irregular. 
Cytoplasm varies in amount and is often 
abundant. Cytoplasmic basophilia is variable. 
Cytoplasmic vacuolation may be present. 


Blasts are large and homogeneous. The nucleocyto- 
plasmic ratio is high, though not as high as in L1. 
Nuclei are predominantly round with a finely 
stippled chromatin pattern, and prominent 
nucleoli. Cytoplasm is strongly basophilic, and in 
at least some cells there is heavy vacuolation. 


a EEE EE 
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Table IX.5 Common chromosome changes in 
myelodysplastic syndromes (MDS) excluding chronic 
myelomonocytic leukaemia (MML). 


eee 
del(5) 

monosomy 7 

del(7) 

trisomy 8 

del(11)(q13q25) (see Fig. 7.9) 

del(13) 

del(20)(q11q13.3) or (q11q13.1) 


del(12)(p11p13) 


an a i ee 


*See Table IX.6 for classification. 


Table IX.6 French-American-British classification of myelodysplastic syndromes excluding chronic myelomonocytic 


leukaemia. 
Disease Blasts in marrow (%) Other features 


Refractory anaemia <5 

Refractory anaemia with <5 
sideroblasts 

Refractory anaemia with 5-20 
excess blasts 

Refractory anaemia with excess blasts in 21-29 
transformation (RAEBt) 


<15% ringed sideroblasts in 
nucleated red cells 

>15% ringed sideroblasts in 
nucleated red cells 


Also RAEBt if Auer rods 
present, irrespective of blast 
count or if =5% blasts in 


blood 


*A blast count of 30% is a diagnostic criteria for acute myeloid leukaemia. 


Table IX.7 Common chromosome changes in 
myeloproliferative disorders (MPD), 


t(9;22)(q34;q11) (CML and CGL) (see Fig. 7.7) 
+1 


-7, +der(1)t(1;7)(p11;p11) 
monosomy 7 

del(7q) 

trisomy 8 

trisomy 9 

del(13q) 

del(20)(q11q13.1) or (q11q13.3) 


*Chronic myeloid leukaemia, chronic granulocytic 
leukaemia, chronic myelomonocytic leukaemia, 
polycythaemia rubra vera, essential thrombocythaemia, 
myelofibrosis. See Table IX.8 for classification. 
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Table IX.8 Classification of 
myeloproliferative disorders. Disease 


Major proliferative component 


Chronic myeloid leukaemia 


Chronic granulocytic leukaemia 
Chronic myelomonocytic leukaemia Myeloid activity predominates 


Polycythaemia rubra vera 
Essential thrombocythaemia 


Myelofibrosis 


Red cell activity predominates 
Platelet activity predominates 
Reactive marrow fibrosis predominates 


Table IX.9 Chromosome abnormalities in lymphomas. 


Chromosomal abnormality 


Association in lymphoma 


Chromosome 1 (structural changes, 
Le. translocations, deletions, duplications, etc.) 

1p (structural changes) 

1q (structural changes) 

inv(2;2)(p13;p11.2p11.14) 

t(2;5)(p23;q35) 

trisomy 3 

Chromosome 3 (structural changes) 

t(3;14)(q27;q32) 

t(3;4)(q27;p11) 

del(3p) 

t(4;16)(q26;p13) 

6p (structural changes) 

del(6q) 

i(6p) 

trisomy 7 

7p15p21 (structural changes) 

7q35q36 

trisomy 8 


t(8;14)(q24;q32), t(2;8)(p12;q24), t(8;22) 
(q24;q11) 

t(10;14)(q24;q32) 

del(11)(q23q25) 


t(11;14)(q13;q32) 
trisomy 12 
t(14;18)(q32;q21) 


14q+ 


inv(14)(q11;q32) 

17q21q25 (structural changes) 
trisomy 18 

del(22)(q11) 


25% of non-Hodgkin’s lymphomas 
(NHL), often as secondary change 

T-cell lymphoma 

Diffuse large cell lymphoma 


Malignant histiocytosis 

T-cell/ Diffuse mixed large and small cell lymphoma 
25% of NHL/diffuse large cell lymphoma 

Diffuse large cell lymphoma 


Immunoblastic lymphoma 

T-cell lymphoma 

T-cell lymphoma 

15% NHL, often as secondary change 
Follicular small cleaved cell type lymphoma 
5-15% NHL/ follicular large cell lymphoma 


T-cell lymphoma 

Follicular, mixed small cleaved cell and large cell 
lymphomas 

Most common translocations seen in 
Burkitt's lymphoma 

B-cell lymphoma 

B-cell immunophenotype. Diffuse mixed small 
and large cell lymphoma 

Small cell lymphocytic lymphoma and B-cell chronic 
lymphocytic leukaemia (CLL) 

Small cell lymphocytic lymphoma and B-cell CLL 

Most common change in NHL. B-cell lymphomas of 
follicular morphology, frequent in small cleaved 
cell type 

Most common change in NHL (includes the t(14;18) 
and t(11;14) as described above). Occurs in 50% 
cases 

T/B-cell lymphoma 

Follicular large cell lymphoma 
10-15% NHL, usually as a secondary change 

Occurring as a Philadelphia translocation, 
variable histological type 


nn ene EEE EES 
Note the involvement of immunoglobulin gene loci (e.g. 14q32, IGH) and T-cell receptor gene loci (e.g. 14q11, TCRA and 


TCRD) in the breakpoints. 
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Chromosomal abnormality Association 

B lineage 

Rearrangements of chromosome 1, 14q+ Multiple myeloma 

Rearrangements of chromosome 1, 14q+ Plasma cell leukaemia 

+12, 14q+, del(13q) Chronic lymphocytic 
leukaemia 

14q+, t/del(12)(p12p13) Prolymphocytic leukaemia 

14q+, del(14q) Hairy cell leukaemia 

? Waldenstrém’s 
macroglobulinaemia 

T lineage 


Rearrangements of chromosome 1, t/del(6p) 


inv(14)(q11q32), t/del(14)(q11) 
14g+, 14q11, del(6q) 


14q1] (structural changes) 


Cutaneous T-cell lymphoma, 
Sézary’s syndrome, mycosis 
fungoides 

Large granular lymphocytic 
leukaemia 

Adult T-cell 
leukaemia/lymphoma 

Prolymphocytic leukaemia 


Table IX.11 Consistent chromosome rearrangements in solid tumours. 


Te eee 


searecuasese 
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Table IX.10 Chromosome 
changes in chronic 
lymphoproliferative disorders. 


Tumour type Chromosomal rearrangement Genes 
Alveolar rhabdomyosarcoma t(2;13)(q35;q14) PAX3, FKHR 
t(1;13)(p36;q14) PAX7,FKHR 
Breast adenocarcinoma del(3)(p14p23) 
del(1)(p13p36) 
del(16)(q21q24) 
del(6)(q21q27) 
dmin, hsr 
Colorectal adenoma +8, +13, +14 
del(1)(p32-36) 
Colorectal adenocarcinoma +13, -14, -18, +X 
del(17)(p11p13) 
del(8)(p11p23) 
del(5)(q22q35) 
del(10)(q22q26) 
Clear cell sarcoma t(12;22)(q13;q12) ATF1, EWS 
Ewing’s sarcoma and t(11;22)(q24;q12) FLI1, EWS 
peripheral primitive t(21;22)(q22;q12) ERG, EWS 
neuroectodermal tumours t(7;22)(p22;q12) ETV1, EWS 
(pPNET), Askin tumour 
Ewing’s sarcoma, rhabdomyosarcoma, der(1)t(1 716)(q10-25;q10-24) 
Wilm’s tumour 
Extraskeletal myxoid chondrosarcoma t(9;22)(q22;q12) CHN/TEC, EWS 
Follicular thyroid adenoma +5, +12 
t(2;3)(q12q13;p14p15) 
Glioma dmin 
Haemangiopericytoma t(12;19)(q13;q13.3) 
Intra-abdominal small cell sarcoma t(11;22)(p13;q12) WT1, EWS 


Lipoma 


Malignant fibrous histiocytoma 


t(3;12)(q27q28;q13q14) 
t/ins(1;12)(p32p34;q13q15) 
t/ins(12;21)(q13q15;q21q22) 
t(2;12)(p21p23;q13q14) 
del(13)(q12;q22) 

Ring chromosomes 
add(19)(p13) changes in 
chromosomes 1, 3p, 11p 


Continued. 
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Table IX.11 Continued. 
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Tumour type 


Chromosomal rearrangement Genes 


Malignant melanoma 


Meningioma 
Myxoid liposarcoma 
Neuroblastoma 


Nonpapillary renal cell carcinoma 


Non-small cell undifferentiated 
lung carcinoma 


Ovarian adenocarcinoma 


Papillary renal cell carcinoma 
Papillary thyroid carcinoma 


Primitive neuroectodermal 
tumours of central nervous system 
Retinoblastoma 


Salivary gland adenoma 
Small cell undifferentiated lung 
adenocarcinoma 


Synovial sarcoma 
Transitional cell bladder 
carcinoma 


Testicular teratoma/seminoma 
Uterine leiomyoma 


Wilm’s tumour 


t/del(1)(p12p22) 
t(1;19)(q12;p13) 
t/del(6q)/i(6p) 
+7 

-22, del(22q) 
t(12;16)(q13;p11) CHOP, TLS/FUS 
del(1)(p31p36) 
hsr, dmin 
—14,-17 
del(3)(p11p22) 
del(5)(q22q35) 
t(3;5)(p13;q22) 
del(3)(p14p23) 
del(15)(p10p11) 
del(9)(p21p23) 
del(17)(p11p15) 
del(11)(p11p15) 
del(1)(p32p36) 
del(7)(p11p13) 
hsr, dmin 
—13,-17,-18, —X 
del(6)(q15q25) 
del(11)(q11q15) 
del(1)(q21q44) 
del(1)(p31p36) 
del(3)(p13p23) 
del(9)(p22p24) 
+17 
t(X;1)(p11;q21) 
inv(10)(q11.2q21) RET, unknown 
t(10;17)(q11.2;q23) 
i(17p) 


Structural changes of 1 

i(6p) 

del(13)(q14)/-13 RB1 
t(3;8)(p21p23;q12) 

=3 

del(3)(p14p24) 
del(1)(q32q44) 
del(17)(p11p13) 
del(5)(q13q33) 

hsr, dmin 
t(X:18)(p11.2;q11.2) 
~9/del(9)(q1q34) 
del(11)(p11p15) 
del(6)(q21q25) 
del(3)(p14p21) 
del(10)(q24q26) 
i(5)(p10) 

i(12p) 
del(7)(q11.2q22;q31q32) 
t(12;14)(q14q15;q23q24) 
—22 

=p 

+12, +18 
del(11)(p13p15) WT1 


SSX1/SSX2, SYT 


Ss a 
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Table IX.12 Gene amplifications associated with solid tumours. 


_— eee 


Gene Location of normal allele Malignancy 

AR Xql1qi3 Prostate carcinoma 

C-MYC 8q24 Breast, colorectal, lung carcinoma and 
many other solid tumours 

CCND1 11q13 Breast, oesophageal carcinoma, 


squamous carcinoma, and many 
other solid tumours 


HST-1 

GST 

SEA 

EGFR 7pl1p13 Squamous cell carcinoma, astrocytoma 

ERBB-2 17q12 Breast, ovarian, gastric carcinoma,and 
many other solid tumours 

GLI 12q13 Soft tissue sarcomas, glioma 

SAS 

CDK-4 

MDM2 

HRAS 11p15 Bladder carcinoma 

IGFR-1 15q25q26 Breast carcinoma 

MYCL 1p32 Small cell lung carcinoma 

MYB 6q22q23 Colorectal carcinoma 

MYCN 2p24 Neuroblastoma, retinoblastoma, small 
cell lung carcinoma, alveolar 
rhabdomyosarcoma 

PDGFRA 4q12 Glioblastoma 

PDGFRB 5q33q35 Glioblastoma 


a ee ee PS ee ee eee 
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Abbreviations, 
acronyms 
and glossary 


AtDB the Arabidopsis thaliana Database 

ABHG American Board of Human Genetics 

ABRC Arabidopsis Biological Research Centre 

ACeDB the Caenorhabditis elegans Database 

ACMG American College of Medical Genetics 

ADA adenosine deaminase 

ADP adenosine-5’-diphosphate 

AFLP amplified fragment length polymorphism 

AFM Association Francaise contre les Myopathies 
(French Muscular Dystrophy Association) 

8-AG 8-azaguanine 

AIMS Arabidopsis Information Management System 

ALL acute lymphocytic leukaemia 

Alu-PCR type of IRS-PCR using primers corresponding to 
the Alu repeat sequence in the human genome 

AMCA 7-amino-4-methylcoumarin-3-acetic acid 

AML acute myeloid leukaemia 

AMP adenosine 5’ monophosphate 

AMPFLP amplified fragment length polymorphism 

AMV avian myeloblastosis virus 

anchored island in genomic mapping, a group of one or 
more clones linked together by anchors (localized 
sequences) they share 

APC familial adenomatous polyposis coli 

APH aminoglycoside phosphotransferase 

APML acute promyelocytic leukaemia 

approximate map genetic map in which the position of 
markers is shown as the range of intervals which a 
particular marker could occupy at framework support. 

APRT adenine phosphoribosyltransferase 

APS ammonium persulphate 
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ARC Association pour la Recherche sur la Cancer 

ARMS amplification refractory mutation system 

ARS autonomously replicating sequence (yeast) 

ASHG American Society of Human Genetics 

ASO allele-specific oligonucleotide 

ATCC American Type Culture Collection 

ATP adenosine triphosphate 

BA 6-benzyladenine 

BAC bacterial artificial chromosome 

back-cross the cross from a hybrid to one of its parental 
strains /species 

bacterial artificial chromosome (BAC) vector based ona 
bacterial F factor 

BCIP 5-bromo-4-chloro-3-indolyl phosphate (X-phos) 

b.p. boiling point 

bp base pair 

Bq becquerel 

BrdU 5-bromodeoxyuridine 

BSA bovine serum albumin 

C-banding chromosome-banding technique that stains 
the constitutive heterochromatin 

cAMP cyclic adenosine-5’,5’-monophosphate 

CAPS codominant cleaved amplified polymorphic 
sequences 

CAT chloramphenicol acetyl transferase 

cccDNA covalently closed-circular DNA 

CCD charge-coupled device 

CD cytosine deaminase 

CDA deoxycytidine deaminase 

CDC Centers for Disease Control 

CDGE constant denaturing gel electrophoresis 
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cDNA complementary DNA 

centiMorgan unit of recombination frequency. One 
centiMorgan (1 cM) is equivalent to 1% recombinant 
offspring or a recombination fraction of 0.01. 

CEPH Centre d’Etude du Polymorphisme Humain 
(Centre for the Study of Human Polymorphism) 

CF cystic fibrosis 

CGH comparative genomic hybridization 

CGL chronic granulocytic leukaemia 

CHIAS chromosome image analysing system 

CHLC Cooperative Human Linkage Center 

chromosome microdissection technique in whicha 
specific region of a chromosome is manually isolated 
from a cell using specially designed microneedles 

chromosome painting visualization of a whole 
chromosome by hybridization with chromosome- 
specific fluorescent probes 

Ci Curie 

CIAP/CIP calf intestine alkaline phosphatase 

CISSH chromosomal in situ suppression hybridization 

CLL chronic lymphocytic leukaemia 

cM centiMorgan 

CML chronic myelomonocytic (myeloid) leukaemia 

CNRS Centre National de la Recherche Scientifique 
(National Centre for Scientific Research) 

coefficient of coincidence ratio of the observed number 
of recombinants to the expected number of 
recombinants. 

comparative genomic hybridization (CGH) technique 
for identifying gains (including genomic amplification) 
and losses of chromosomal material by the 
cohybridization of differentially labelled tumour and 
normal DNA to normal chromosomes 

complex traits/diseases traits or diseases that are not 
inherited as simple Mendelian traits 

comprehensive map genetic map in which markers are 
included in their most likely positions irrespective of 
the statistical support 

ConA concanavalin A 

contig a contiguous set of clones spanning a region 
without gaps 

CORN Council of Regional Networks for Genetic Services 

cosmid plasmid vectors of approximately phage A size, 
which are introduced into Escherichia coli by in vitro 
packaging and infection as defective A phage and 
circularize in vivo 

cpm counts per minute 

cR centiRays 

CSSH chromosomal in situ suppression hybridization 

Da dalton 

DABCO 1,4-diazobicyclo(2.2.2.)octane 

DAPI diamidino-2-phenylindole-dihydrochloride 

DBE direct blotting eletrophoresis 

dbEST database of expressed sequence tags 

DCK deoxycytidine kinase 

DDBJ DNA Data Bank of Japan 

ddNTP dideoxyribonucleoside triphosphates 

DEAE diethyl aminoethyl 

degenerate oligonucleotide primed PCR (DOP-PCR) 
method for random amplification by PCR of short DNA 


fragments at frequently occurring priming sites within 
the genome using a primer that contains partially 
degenerate sequence 

denaturing gradient gel electrophoresis technique for 
detecting mutations 

DGGE denaturing gradient gel electrophoresis 

DHFR dihydrofolate reductase 

DKFZ Deutsche Krebsforschungs Zenter (Heidelberg, 
Germany) 

DMD Duchenne muscular dystrophy 

DMF dimethylformamide 

DMSO dimethylsulphoxide 

DNA deoxyribonucleic acid 

DNA fingerprinting strictly, a method of identifying an 
individual by DNA analysis by typing a large number 
of hypervariable loci to obtain an individual-specific 
pattern. 

DNase deoxyribonuclease 

DNP 2,4-dinitrophenyl 

dNTP deoxyribonucleoside triphosphosphate 

DOE Department of Energy (USA) 

DOP-PCR degenerate oligonucleotide primed PCR 

dpm disintegrations per minute 

dsDNA double-stranded DNA 

dsRNA double-stranded RNA 

DSS disuccinimidyl suberate 

DTE direct transfer electrophoresis 

DTT dithiothreitol 

EBI European Bioinformatics Institute 

EBV Epstein-Barr virus 

EC European Commission 

EDTA ethylenediaminetetraacetic acid 

EEC European Economic Community 

ELISA enzyme-linked immunosorbent assay 

EMBL European Molecular Biology Laboratory 
(Heidelberg, Germany) 

EMBO European Molecular Biology Organization 

EMS ethyl methanesulphonate 

ESSA European Scientists Sequencing Arabdopsis 
(project) 

EST expressed sequence tag 

EtBr ethidium bromide 

EU European Union 

EUCIB European Collaborative Interspecific Backcross 
(program) (mouse) 

EUROFAN European Functional Analysis Network 

EUROGEM European Gene Mapping Project (EC 
sponsored project) 

eV electron volt 

exon trapping cloning technique for isolating coding 
sequences using specialized vectors in which only 
DNA containing exons can be maintained 

F farad 

F1, F2, etc. first filial generation; second filial generation 
etc. 

FACS fluorescence-activated flow sorting (of 
chromosomes) 

FAP familial adenomatous polyposis coli 

FCS fetal calf serum 

FdU fluorodeoxyuridine 
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FIGE field inversion gel electrophoresis 

FISH fluorescence in situ hybridization 

FITC fluorescein isothiocyanate 

fluorochrome dye that fluoresces in a particular colour 
under UV 

forward chromosome painting technique using 
chromosome paints prepared from normal 
chromosomes and applied to metaphases containing 
abnormal chromosomes to reveal the identity of the 
aberrant chromosomes 

fp flash point 

framework map genetic map in which the placement of 
individual loci has a statistical support of at least 
1000: 1 

FTP, ftp file transfer protocol 

G-banding Giemsa banding of chromosomes 

GDB Genome Data Base (Baltimore, USA) 

gDGGE genomic DGGE 

genetic map a map showing the order and distance 
between polymorphic chromosomal markers, 
constructed by linkage analysis 

genomic mismatch scanning technique for identifying 
regions of identity between DNA samples by solution 
hybridization and detection of heteroduplexes 

GISH genomic in situ hybridization, chromosome 
painting 

GM-CSF granulocyte or macrophage colony stimulating 
factor 

gridding arraying clones from multiple microtitre plates 
at high density onto membranes. These can be used for 
screening all clones in parallel by hybridization. 

h hour 

heterozygosity (of a locus) the frequency in the 
population of heterozygotes at that locus, which is 
usually expressed as a percentage or a frequency value 
between 0 and 1.0 

HGMP Human Genome Mapping Project (UK) 

HGMW Human Gene Mapping Workshops 

HGPRT/HPRT hypoxanthine guanine 
phosphoribosyltransferase 

HHMI Howard Hughes Medical Institute (USA) 

HNPCC hereditary nonpolyposis colon cancer 

homoeologues genetically and evolutionarily related 
chromosomes from different genomes within a 
heterogenomic polyploid or from related species; they 
are capable of pairing among themselves. 

HPLC high-performance liquid chromatography 

HUGO Human Genome Organization 

hypervariable locus locus with an exceptionally high 
degree of polymorphism 

i.b.d identical-by-descent 

ICRF Imperial Cancer Research Fund (UK) 

IFGT irradiation and fusion gene transfer 

IGD Integrated Genome Database (Heidelberg, 
Germany) 

in situ hybridization banding (ISHB) R-banding by 
hybridizing with labelled Alu sequences 

inclusive map see comprehensive map 

incomplete penetrance case where some carriers of a 
dominant gene (or homozygotes for a recessive gene) 


POCO Kero ees eOeeHeeas @eesescevess 


do not express the phenotype; e.g. for a disease gene 
with incomplete penetrance, some carriers do not 
express any symptoms of the disease at all 

integrated genome map 

interphase FISH analysis FISH-based technique for 
identifying specific rearrangements in nondividing 
cells using region-specific markers. 

IPTG isopropyl-D-thiogalactoside 

irradiation and fusion gene transfer (IFGT) fusion of an 
irradiated donor cell with a nonirradiated recipient cell 
line; the resultant hybrids contain many fragments of 
donor chromosomes 

IRS interspersed-repeat sequence 

IRS-PCR interspersed repetitive sequence PCR 

ISCN International System of Cytogenetic Nomenclature 

ISHB in situ hybridization banding 

ITC isothiocyanate 

kb/kbp kilobases/kilobase pairs (10° bases/base pairs) 

kd, kDa kilodalton 

KOAc potassium acetate 

LA-PCR linker-adaptor PCR 

LB (1) loading buffer; (2) Luria broth 

LINE long interspersed repeat element 

linkage map see genetic map 

linker-adaptor PCR technique for production of complex 
chromosome-specific libraries from DNA ligated at 
each end to an adaptor oligonucleotide and amplified 
by PCR using a primer for the adaptor sequence. 

LMP low melting point 

lod score (z) method of determining whether two loci are 
linked. The log, of the ratio of the odds on linkage 
between two loci at a given recombination fraction. 

LOH loss of heterozygosity 

LSB low salt buffer 

LTR long-terminal repeat 

mA milliampere 

MALDI matrix-assisted laser desorption 

Mb/Mbp megabases/megabase pairs (10° bases /base 
pairs) 

MC multiple copy 

MDS myelodysplastic syndromes 

mg milligram 

MGD Mouse Genome Database 

microFISH FISH using a probe produced by chromosome 
microdissection and amplification by PCR 

microcell hybrids somatic cell hybrids derived from the 
fusion of micronuclei (subnuclear packets containing a 
subset of the donor genomic chromosomes) with intact 
recipient cells 

microcell-mediated chromosome transfer (MMCT) 
production of a hybrid somatic cell by fusion of a cell 
from one species with a microcell derived from another 

microsatellite an array of tandemly repeated very short 
sequences of nucleotides 

min minute 

minisatellite an array of tandem repeats typically in the 
range 1-30 kb and composed of 8- to 100-bp repeats 

MMCT microcell-mediated chromosome transfer 

mol mole 

mol. wt molecular weight 
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m.p. melting point 

MPD myeloproliferative diseases 

MRC Medical Research Council (UK) 

mRNA messenger RNA 

MTX methotrexate 

multiplex sequencing a variant of the shotgun 
sequencing strategy in which a number of samples are 
pooled during processing and separated by 
hybridization detection at the end of the process. 

MVR-PCR minisatellite variant repeat PCR 

NASC Nottingham Arabidopsis Information Centre 

NBT N-hydroxylsuccinimidy] 

NCHGR National Center for Human Genome Research 
(USA) 

NCI National Cancer Institute (USA) 

NIH National Institutes of Health (USA) 

NOR nuclear organizer region 

NPT Il neomycin phosphotransferase II 

NRC National Research Council (USA) 

NSF National Science Foundation (USA) 

NTA nitrilotriacetic acid 

NTP nucleoside triphosphate 

ODC ornithine decarboxylase 

OLA oligonucleotide ligation assay 

OMIM Online Mendelian Inheritance in Man 

ORF open reading frame 

P1 artificial chromosome (PAC) cloning system based on 
the P1 cloning vector but using a circular recombinant 
DNA 

PAC P1 artificial chromosome 

PAGE polyacrylamide gel electrophoresis 

PALA N-(phosphonacetyl)-L-aspartate 

PASA PCR amplification of specific alleles 

PBS phosphate-buffered saline 

PBSA phosphate-buffered saline solution A 

PCR polymerase chain reaction 

PEG polyethylene glycol 

penetrance the probability that an individual is affected 
given their genotype at the disease-causing locus. 

PFGE pulsed field gel electrophoresis 

PGD Plant Genome Database 

PH hydrogen ion exponential 

PHA phytohaemagglutinin 

phagemid plasmid containing the origin of replication 
from a filamentous phage, which can be packaged into 
the phage capsid. 

physical map an ordered sequence of overlapping 
cloned DNAs that span a genomic region 

pl isoelectric point 

PI propidium iodide 

PIC polymorphism information content 

picking inoculating single colonies into wells of 
microtitre plates, and growing each as individual 
cultures. If these cultures are frozen in suitable freezing 
medium, they serve as a permanent source of these 
clones from which unlimited copies can be made. 

PMA phorbol myristate acetate 

PMSF phenylmethylsulphonylfluoride 

polymorphism genetic variation at a locus at which at 
least one in 50 unselected individuals has a variant 


allele; that is, the variant allele has a frequency greater 
than 0.01. 

polymorphism information content (PIC) a measure of 
informativeness of a marker in linkage studies which 
takes into account the fact that half the progeny of 
matings of the type A1A2 x A1A?2 will also be 
heterozygous and therefore uninformative for linkage. 

probe labelled nucleic acid of known sequence used to 
detect and identify complementary sequences by 
hybridization 

PTS probe-tagged site 

PWM pokeweed mitogen 

Q-banding quinacrine banding of chromosomes 

QTL quantitative trait loci 

R-banding reverse banding of chromosomes 

RACE rapid amplification of CDNA ends 

radiation hybrid (RH) somatic hybrid cell produced by 
fusion of a cell from one species with a radiation-treated 
cell from another species, in which the chromosomes 
have been fragmented. 

radiation hybrid mapping sce RH mapping 

radiation mapping mapping technique involving cell 
hybrids that is based on the fact that the probability that 
two loci will be separated by a radiation-induced break 
and be carried on different chromosomal fragments 
should be proportional to their distance apart 

RAPD random amplified polymorphic DNA 

RAPD-PCR method of detecting polymorphisms by 
scanning the genome by PCR for a number of 
arbitrarily primed polymorphic loci 

RDA representiational difference analysis 

RE restriction endonuclease 

recombination fraction (RF) the proportion of the total 
number offspring that do not have a parental 
combination of alleles; that is, those that havea 
recombined pattern. RF = (number of recombinant 
offspring) / (number of recombinant offspring + number 
of nonrecombinant offspring) 

representational difference analysis (RDA) technique 
for identifying restriction fragments present in one 
sample but missing from another 

RF recombination fraction 

RFH radiation fusion hybrid 

RFLP restriction fragment length polymorphism 

RFLV restriction fragment length variant 

RGP Japanese Rice Genome Research Program 

RH radiation hybrid 

RH mapping radiation hybrid mapping, a somatic cell 
genetic technique in which DNA markers can be 
mapped relative to one another using IFGT. It is based 
on the fact that the likelihood of two markers being 
separated by a radiation-induced break in the DNA is a 
function of physical distance, so that markers closer 
together have a higher probability of coretention in any 
given hybrid than markers further apart. 

RIL recombinant inbred line 

RNA ribonucleic acid 

RNase ribonuclease 

rpm revolutions per minute 

rRNA ribosomal RNA 
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RT reverse transcription /transcriptase 

RT-PCR reverse transcription/PCR amplification 
procedure 

s second 

SBH sequencing by hybridization 

SC single copy 

SCAR sequence characterized amplified regions 

SCE buffer sorbitol/ sodium citrate/EDTA 

SDS sodium dodecyl sulphate 

shotgun sequencing strategy for DNA sequencing in 
which the DNA is fragmented at random, the 
fragments sequenced, and the sequence then assembled 
in the correct order. 

SIGMA System for Integrated Genome Map Assembly 

SINE short interspersed repeat elements 

SSB single-stranded binding (protein) 

SSC sodium chloride/sodium citrate 

SSCP single-stranded conformation polymorphism 

SSCP analysis technique for detecting mutations 

SSCT sodium chloride/sodium citrate/Tween or Triton- 
X-100 

SSCTM sodium chloride/sodium citrate /Triton- 
X-100/dried milk 

ssDNA single-stranded DNA 

SSLP simple sequence length polymorphism 

ssRNA single-stranded RNA 

STA Science and Technology Agency (Japan) 

STC buffer sorbitol /Tris-HCI/CaCl, 

STM scanning tunnelling microscopy 

STR simple tandem repeat, i.e. di- tri- and tetranucleotide 


repeat loci 

STS sequence-tagged site 

synteny the situation where a set of genes is in the same 
order on the chromosome in different organisms 

TAE buffer Tris/acetic acid/EDTA 

TBE buffer Tris/boric acid/EDTA 

TdT terminal deoxynucleotidyltransferase 

TE buffer Tris/EDTA 

TEMED N,N,N’,N“tetramethyl-1,2-diaminoethane 

6-TG 6-thioguanine 

TGGE temperature gradient gel electrophoresis 

TK thymidine kinase 

T,, melting temperature 

TPA the mitogen 12-O-tetradecanoylphorbol-13-acetate 

TRAP tumour necrosis factor-related activation protein 

TRITC Texas red isothiocyanate 

tRNA transfer RNA 

UAS upstream activating sequence 

URF unidentified reading frame 

UV ultraviolet 

VNTR variable number tandem repeat 

v/v volume/volume 

w/v weight/volume 

w/w weight/weight 

WWW World Wide Web 

X-GAL 5-bromo-4-chloro-3-indolyl-b-galactopyranoside 

X-Phos see BCIP 

YAC yeast artificial chromosome 

YGSC Yeast Genetic Stock Center 

Z lod score 
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Page numbers in italic type refer to figures 
and tables; those in bold refer to protocols 


abortion, recurrent 156 
abscisic acid sensitivity control 773 
ACEDB database 691, 692, 693, 812, 834, 
835 
(AC), repeats 108-9 
acute leukaemia symptoms 164 
acute lymphoblastic leukaemia 162 
acute lymphocytic leukaemia 160 
acute lymphoid leukaemia (ALL) 
chromosome changes 978-9 
classification 979 
acute myeloid leukaemia (AML) 160, 162, 
163, 164 
chromosome changes 977-8 
chromosome painting 248,249 
classification 978 
acute non-lymphocytic leukaemia (ANLL) 
162 
acute promyelocytic leukaemia (APML) 
163 
ADA gene 653-4 
addresses 879-80 
adenine phosphoribosyltransferase 333 
adenine plus alanosine medium 333 
adenomatous polyposis coli (APC) 82 
adenosine deaminase (ADA) 653 
deficiency and gene therapy 662, 663 
adenoviral vectors 662 
adherent cell—adherent cell fusions 327 
Aegilops speltoides 747 
AFLP scanning 113 
AGAMOUS gene 779 
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age, breast cancer liability classes 32 
AIDS, gene therapy 652, 660-1 
albizzin/asparagine synthetase 337 
alkaline phosphatase 
chemiluminescent substrates 604 
fluorigenic substrates 606 
sequence labelling 603 
alkaline phosphatase-labelled antibodies 
603 
allele 
drop-out 134 
frequencies 130 
marker 39-40 
allele-specific oligonucleotide (ASO) 
106 
allelic heterogeneity 33 
allopolyploid 746 
alpha satellite probes 217 
Alu consensus 620, 621 
Aluelements 109, 111 
human 110 
Alu paints 299 
Alu repeat 244-5 
exon amplification 446 
Alu sequences, repeat analysis 619, 620, 
621 
Alu subfamily identification 620, 622 
Alu-PCR 108, 244-5 
banding 245 
chromosome paints 156, 249,296 
generation 298-9 
flow sorting 294 
flow-sorted chromosomes 
abnormal Plate 5, 246 
amplification 252-3 
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with primer AGK34 227-8 
somatic cell hybrid characterization 345 
whole chromosome painting probes 
Plate 4 
YACs 218 
amplified sequences 249 
alveolar rhabdomyosarcoma 194,195 
genomic hybridization analysis Plate 1 
interphase FISH analysis Plate 2 
Alzheimer’s disease 656 
7-amino-4-methylcoumarin-3-acetic acid 
(AMCA) 215 
ammonium persulphate (APS) 581 
amniocentesis 153 
amniotic fluid 
cultures 173-4 
harvesting 153-4, 177-9 
prenatal diagnosis 153 
unsynchronized cultures 277 
amphidiploids, plant genome analysis 
750-1 
amphiplasty 748 
ampicillin resistance 375 
amplification refractory mutation system 
(ARMS) 107 
amplified length polymorphisms 
(AMPFLP) 134 
data interpretation 135 
anchored island 433, 434 
length 436 
number 436 
proportion of genome not covered by 
436 
singleton 434, 435 
annealing, probe ordering 427,429 
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anonymous ftp 815,816, 820-33 
Anopheles gambiae 683 
anti-BrdU antibodies 216 
anti-BrdU-FITC 235 
Antirrhinum, myb-related genes 780 
antisense constructs 657 
APC gene 500,501 
mutations 501,503 
aplets 835 
apopotosis 656-7 
APRT gene 338 
Arabidopsis Plate 10, 632,762 
AGAMOUS gene 779 
cDNA sequences 811 
chloroplast genome 763 
chromosome number 887 
Cnx] protein 780-1 
cross hybridization of genes 779 
ESSA project 770,778 
generation time 762 
geneticmaps 764-5 
amplified fragment length 
polymorphisms (AFLPs) 765 
expressed sequence tags (ESTs) 765 
molecular markers 765 
mutants 764 
recombinant inbred lines 765 
genome 
mapping project 438 
size 762-4, 888 
Information Management System 
(AIMS) 775,778 
mitochondrial genome 763 
model for non-plant species 780-1 
myb-related genes 780 
physical map 765-6 
RFLP markers 766 
seed size 762 
software 775 
communication tools 777 
stock centres 774-5, 777 
useful addresses 777-8 
YAC clone markers 766 
YAC libraries 765-6 
ARABIDOPSIS newsgroup 775-6, 777 
Arabidopsis nuclear genome 762,763 
AtDB database 775,776 
Athila retrotransposon 763 
cDNA sequencing 386-90 
classical gene identification 766 
DNA renaturation kinetics 763 
enhancer trap 769 


expressed sequence tags (ESTs) 769-73 


gene 
density 773 
duplication 774 
identification strategies 766-9 
intron number 773 
genetic complexity 765 
genomic sequencing 773-4 
map-based cloning 766~9 
microsatellites 763 
model for other species 779-81 
multigene families 772,773 
mutant complementation 766-7 
organization 763 


plasma membrane integral proteins 766 


promoter trapping 769 
regulatory protein 773 
repeat sequences 763 
sequencing of large portions 774 
size 763 
T-DNA 767-8, 769 
Tal element 763 
Tatl 763 
transposon tagging 768-9 
tumour suppressor gene homologues 
772 
asnA gene 337 
asparagine synthetase 337 
asynapsis 751 
ataxia telangiectasia 475 
AtDB database 764,775,776 
Atto-Phos 606 
autoallotriploid hybrids 750 
autopolyploid 746 
auxotrophic mutants 333-4 
avian myeloblastosis virus reverse 
transcriptase 565 
avidin-fluorescein isothiocyanate see 
fluorescein isothiocyanate (FITC) 


B-cell mitogen 166 
Bacillus megaterium genome size 888 


Bacillus subtilis genome sequence 710, 736 


back-crosses 48 
bacterial artificial chromosomes (BACs) 
371 
rice physical map 805 
vector system 642 
bacterial infection 326 
bacterial interspersed mosaic element 
(BIMES) 729 
bacteriophage see phage 
Bal31 523 
deletion strategy 560 
exonuclease 558 
BamH1/Bglll cosmid digest 445, 446 
barr program 430, 432, 433 
BCI progeny 786 
BCR-ABL fusion gene 151 
BELLI gene 780 
bioinformatics 811 
biotin 218-19, 220, 228-30 
detection with FITC 235 
detection with Texas red 235-6 
DNA labelling by nick translation 
228-30 
quality control of labelling 230-1 
sequence labelling 602, 603 
biotin-11-dUTP, PCR product labelling 
296 
biotin-streptavidin system 551-2 
bis-acrylamide 580,581 
BLASTN program 624 
BLASTX program 771 
ble gene 337 
bleomycin- and phleomycin-binding 
protein 337 
blood bottle preparation 869 
blood cell transformation with 
Epstein-Barr virus 871 
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blood culture, unsynchronized 276 
blood samples 
cryopreservation 275-6 
lymphocyte separation 274 
processing 168-9 
slide preparation 169-70 
blood tube processing 869-70 
Bloom’s syndrome 475-6 
bobbed (bb) mutation 673 
bone marrow culture, unsynchronized 276 
bone marrow samples 
cryopreservation 275-6 
synchronization technique 184-5 
Brassica genome 779 
BRCAI gene 28, 29,34 
linkage 34 
BRCA2 gene 34 
BrdU see 5-bromo-2-deoxyuridine (BrdU) 
breast, phyllodes tumour 192 
breast cancer 
Aallele 32 
age 32 
family pedigrees 29,30 
G-banding in carcinoma 192 
inherited susceptibility 28 
liability classes 31,32 
linkage analysis 30-3 
linkage heterogeneity 34-5 
media for cytogenetics 196 
recombination fraction 31 
sarcoma 196 
segregation analyses 30 
5-bromo-2-deoxyuridine (BrdU) 216 
incorporation during early S-phase for 
replication G-banding 224-5 
incorporation during late S-phase for 
replication G-banding 223-4 
5-bromo-4-chloro-3-indolyl phosphate 
(BCIP) 603 
Bruton’s tyrosine kinase (Btk) 641 
BstXI site 447 
Btk gene 641 
bulbar muscular atrophy 111 
bulked segregant analysis 785 
bulletin boards 813, 814-15, 885 
Burkitt's lymphoma 165 


c7dATP 558 
c7dGTP 558 
C-banding see constitutive 
heterochromatin banding 
Caenorhabditis elegans 632, 688 
chromosome III 691, 692 
chromosome number 887 
cosmid clones 688 
DNA sequence 690 
expressed sequence tags 691 
genes 688 
genome 693 
map 688-9 
sequence 689-90 
size 888 
hybridization of individual YAC and 
cosmid clones 688 
mapping techniques 688-9 
physical map 690 
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restriction enzyme-based fingerprinting 
688, 689 
sequence-tagged site (STS) assays 688 
sequencing 513,521 
project 690-3 
targeted gene disrupion 691 
transgenic reporter constructs 691 
YAC clones 688 
campomelic dysplasia 345 
cancer 
chromosome aberrations 977-84 
chromosome analysis 292 
gene therapy 656-9 
genetics and denaturing gradient gel 
electrophoresis 500 
minimal residual disease detection 147 
capillary blotting, sequence transfer 608 
capillary electrophoresis 585-7 
analysis time 586 
band width 586 
capillaries 586 
electric field conditions 587 
electro-osmotic flow 586 
gel matrix 586-7 
hydrophobic associations 587 
Joule heating 586 
polyacrylamide 587 
polymer entanglements 587 
sequencing gel 593-4 
carcinogenesis, rat model 296 
case-control tests 40 
CD4* T cells 661 
cDNA 
analysis in rice 
large scale 776-8, 779-81, 782-3 
strategy 777,782 
enrichment 443-4, 447-8, 458-62 
filters for direct cosmid /YAC 
hybridization 456-7 
full-length 448 
hybridization 443,445 
direct 447 
identification in rice 778, 782-3 
ligation to vector 483-5 
minilibrary 443, 444 
probes 217 
rapid amplification 448, 462-6 
selection 320 
sequences 681 
map-based cloning of target genes in 
rice 807-8 
plant databases 811 
sequencing in Arabidopsis nuclear 
genome 386-90 
size fractionation 482-3 
cDNA library construction 482-3 
synthesis 480-2 
cDNA library construction 480-2 
from cytoplasmic RNA 452-4 
transient expression screening 472 
transfectant functional analysis 477, 
491-2 
walking 443,444 
cDNA clones 103 
enrichment factors 448 
functional adhesion assays 491-2 
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large insert 448 
rice 776,778,783 
transient expression screening 471 
cDNA library 445 
construction 472-6 
cDNA size fractionation 482-3 
cDNA synthesis 480-2 
ligation of cDNA to vector 483-4 
poly(A)*RNA preparation 479-80 
protocol 478-86 
rice 776-8 
RNAisolation 478-9 
vectors 472-6 
viral genome replication 473 
EBYV-based episomal 475-6 
functional adhesion assays on cloned 
cDNAs 491-2 
rice 
clone composition 782 
genetic linkage map 786-801 
redundancy analysis 782-3 
tissue specificity 778 
screening 
cell-surface proteins 486-91 
transient expression in mammalian 
cells 470-2 
YACs 645 
cell culture 325-6 
fusogens 326-7 
cell lines, unsynchronized cultures 277 
cell passaging 203-4 
cell-surface antigen identification 333 
cell-surface markers, fluorescent detection 
248 
cell-surface molecules 472 
cell-surface protein screening by panning 
and rescue 476, 486-91 
cell-synchronizing agents 265-6 
CENSOR program 624 
Centre d’Etude du Polymorphisme 
Humain (CEPH) 53, 54,55, 57,90 
genotyping errors 84 
maps 46,47, 48, 101 
cereal genome research 811-12 
CFTR gene 652 
CFTRdeltaF508 mutation 9 
CG-clamping 502 
CGSC database 735 
charge-coupled device (CCD) 195, 606 
charge-coupled device (CCD) camera 
Plate 8, 308-10 
colour 310 
dark current 309 
photonic noise 309 
pixel binning 309, 310 
quantum efficiency 309 
subarray sampling 309-10 
types 309 
chemiluminescent substrates 
alkaline phosphatase 604 
enhanced for horseradish peroxidase 
604-5 
sequence labelling 604-6 
Chi sequence, E. coli 728 
CHIAS program 748 
chiasmata distribution 7,8 


Chiasmatype Theory 7 
children, cytogenetic analysis 158, 159 
chimaera program 432, 433 
chimaeric clones 370 
chimaerism, long-range physical map 
construction 432-3 
CHLC see Cooperative Human Linkage 
Centre (CHLC) 
chloramphenicol acetyl transferase (CAT) 
gene 674 
chloroplast DNA analysis 746 
chorionic somatotrophin, human genomic 
locus 626 
chorionic villus, unsynchronized cultures 
277 
chorionic villus sample 174~7 
harvesting 153-4, 177-9 
prenatal diagnosis 153 
transport medium 153 
chromatin 
fibres 
FISH probe mapping 222 
release from interphase nuclei 215 
interphase nuclei 216 
release from interphase nuclei 225-6 
released 217 
chromomycin A, dye 292, 293, 296, 297 
chromosomal DNA 
denaturation 231-2 
hybridization 232-3 
chromosomal in situ suppression (CISS) 
218 
chromosome 6 microdissection 268 
chromosome 6q26-27 region 270 
library 271 
micro-FISH analysis of DNA Plate 7 
chromosome 21 cosmid contig Plate 8 
chromosome 
aberrations 292 
associated with cancer 977-84 
abnormality 
detection by chromosome painting 
Plate 6, 247-8 
flow sorting Plate5 
malignant cells 159 
analysis 147 
bivariate 296,297 
lymphocyte preparation 152, 168-71 
assignment for microcell-mediated 
chromosome transfer 339 
bar codes 243 
breakage 7 
committees 881-5 
cryopreservation 266 
direct preparations 198-9 
DOP-PCR amplification 253-5 
Drosophila melanogaster 669,670,671, 
676, 677-8 
Escherichia coli 725-9 
fixation 266 
flow cytometry 190 
flow sorting Plate 5, 147-8, 294, 299-300 
flow-sorted 
Alu-PCR amplification 252-3 
DOP-PCR amplification 253-5 
library 243-4 
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construction 296, 298, 302-3 
markers 
isozymes 753 
RFLPs 753,754 
metaphase 215-16 
number of common species 887 
preparation 
for library construction 302-3 
magnesium sulphate method 301-2 
polyamine method for flow sorting 
299-300 
probes see chromosome paints 
prometaphase 152 
thymidine block synchronization 
171-2 
ring 157,159 
satellite 748 
segregation of somatic cell hybrids 328 
soup 170 
spreading 265 
walking 374 
see also metaphase chromosomes; 
microdissection 
chromosome 17 345 
chromosome banding 266 
karyotypic analysis 748-9 
meiotic 753 
microdissection 266 
techniques 154-6 
see also constitutive heterochromatin 
banding (C-banding); G-banding; 
quinacrine banding; R-banding 
chromosome painting 242-3, 295 
applications 247-8 
banding techniques 156 
chromosome abnormality detection 
Plate 6 
chromosome libraries 243-4 
chromosome-specific probes 217 
commercial probes 246-7 
competitive in situ suppression 
hybridization 242,247, 255-7 
DOP-PCR amplification 253-5 
forward 295 
hybridized labelled probe detection 
257-9 
interphase cytogenetics 248 
interspersed repetitive sequence-PCR 
345 
limitations 248-9 
malignant myeloid disorders 249 
microdissection 249 
multicolour 246-7, 248 
nick translation 250-2 
probe concentrations 247 
probes Plate 4, Plate 5 
resources 244~—7 
reverse 246,249, 295 
chromosome paints 147 
Alu-PCR 244-5 
DOP-PCR 245 
flow sorting of abnormal chromosomes 
246 
generation 298-9 
TRS-PCR 244-5 
microdissection and FISH 246 


chromosome pairing 
cytoplasmic effects 751 
discrimination between parental 
chromosomes 752-3 
hybrid plants 749-51 
marked chromosomes 752 
mathematical models 753 
meiotic chromosome banding 753 
parental chrosomosomes 752-3 
Phi-regulated 751-2 
chronic granulocytic leukaemia/chronic 
myeloid leukaemia (CGL/CML) 
160, 162, 163, 164 
chronic lymphoblastic leukaemia 162 
chronic lymphocytic leukaemia (CLL) 161 
symptoms 165 
chronic myeloblastic leukaemia t(9;22) 
translocation 151 
chronic myeloid leukaemia 164 
chronic myelomonocytic leukaemia 160, 
161 
Clarke-Carbon bank of E. coli 719,724 
CLODSCORE 57-60 
clone, picking 326 
cloned DNA mapping 325 
sequence 323 
cloning 
map-based 802-3, 807-9 
positional 369, 640 
rings 326 
clotting factors 655 
cluster homology regions, yeast 708-9 
Cnx1 protein 780-1 
codA gene 338 
codons 513 
colcemid 193 
Colibri database 735 
Collaborative Research (CRI) maps 101-2 
collagenase 191 
colorectal tumourigenesis 656 
denaturing gradient gel electrophoresis 
500 
colorimetric substrates, sequence labelling 
603-4 
command line interface 808 
commercial suppliers 873-6 
comparative genomic hybridization 
(CGH) 193-5, 196 
cameras 308 
solid tumour cytogenetics 189, 206-9 
competitive in situ suppression 
hybridization (CISSH) 242,247, 
255-7 
complementarity, local reverse 621, 623 
complex traits 28 
analysis 38-40 
genetic heterogeneity 33-5 
Mendelian trait with covariates 28, 29, 
30-3 
no clear mode of inheritance 35-7 
sampling problems 37-8 
comprehensive map 45 
flowchart algorithm for construction 92 
computer 
application programs 808 
filing system 809 
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mail lists 814 
network 
connectivity 810 
node 810 
services 813-35 
user interface 808-10 
computing 
client-server 811 
hardware 808 
operating system 808 
consensus map for chromosome 1, 
MultiMap 94 
constitutive heterochromatin banding (C- 
banding) 155, 156, 159, 183-4 
karyotypic analysis 748-9 
contig 422 
directed walking for long-range physical 
map construction 426 
gap sequences 567 
primers 567 
shotgun sequencing 519 
contig assembly 
Drosophila melanogaster 678 
microcloned DNA 273 
Cooperative Human Linkage Centre 
(CHLC) 47,54, 57 
maps 101 
copia element of Drosophila melanogaster 
673, 678 
COS cells 470,472, 474 
cDNA introduction 476 
functional adhesion assays on cloned 
cDNAs 491-2 
panning 476-7 
transfectable 475 
cos sites 375 
cosmid 102-3, 217 
clones 
Caenorhabditis elegans 688 
host strain 376 
colour determination 313 
contigs 108 
CsCl gradient 533,535-6 
purification 546-7 
direct hybridization to cDNA filters 
456-7 
DNA 546-7 
fingerprinting in Drosophila melanogaster 
678 
long-range mapping 370 
QIAGEN plasmid kits 535 
screening by hybridization 377-8 
template amplification 533, 535-6 
cosmid libraries 370 
clone handling 376 
construction 397-401 
Drosophila melanogaster 677 
host 375 
insert DNA preparation 376 
long-range mapping 375-6 
rice physical map 805-6 
vector 375 
costig program 430 
Cot-1 DNA 247 
hybridization 443, 444 
CpG dinucleotide 444 
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CRI-MAP 47, 49,52,53-4 
automatic map-building 81 
data set construction 62-3 
EUROGEM 85 
genotype file 51 
haplotype screening 84 
likelihood calculations 81 
locus file 63 
log,, (likelihoods) 81 
MultiMap 91-2,93,94 
output 81 
parameter file 62, 64,81, 84 
protocol 64-80 
reference maps 61-82 
use 63-4, 81 
CROP algorithm 83 
crossability 747 
crossing-over 6,7 
double 7 
cryopreservation of blood and bone 
marrow samples 275-6 
cystic fibrosis 
chloride ion transporter (CFTR) protein 
650, 652 
expression vector 652-3 
denaturing gradient gel electrophoresis 
500 
gene therapy 652-3 
genetic markers 9 
genotypes 9 
linkage mapping 28 
phenotypes 9 
z, lod scores 14-16 
cytochalasin B 341 
cytogenetic analysis 147-8, 151-2 
applications 151-2 
approaches 167 
banding techniques 154-6 
cancer 159-61, 162, 163-4, 165, 166-7 
cell culture methodology 152-4 
constitutional abnormality detection 
156-9 
digitalimaging 167-8 
microdissection 273 
prenatal diagnosis 153-4 
procedure 161,163, 164, 165-6 
slide preparation 168-71 
whole blood sample processing 168-71 
cytogenetics 
nomenclature 161 
see also solid tumour cytogenetics 
cytokines 470,472 
tumour cell expression 658 
cytomegalovirus enhancer 474 
cytosine deaminase 338 
cytotoxic genes 657-8 


D1S8 locus, MVR-PCR 135, 140-2 
DA/DAPI staining 155 
dad1 gene 781 
DAPI 
digital camera imaging 310 
FISH 216 
fluorescence 314 
dark current 309 
data acquisition 812 


database 735,736 
conceptual schema 812 
information sources 847 
organisms 844-6 
software 847 
subject indexes 844-7 
technology 811-12, 813 
data acquisition phase 812 
World Wide Web 847-61 
ddNTP see dideoxyribonucleoside 
triphosphate (ddNTP) 
dechimaerization inserts 432 
defective cell phenotype complementation 
470 
degenerate oligonucleotide primed PCR 
(DOP-PCR) 245-8, 280-1, 298 
amplification 270, 280-1 
microcloning of products 286-7 
chromosome painting 296 
flow-sorted chromosome amplification 
253-5 
microcloning of amplification products 
286-7 
microdissection 268-9, 270 
oligonucleotide primer 268-9 
denaturing gradient gel electrophoresis 
496-7 
applications 500 
band problems 508 
colorectal tumourigenesis 500 
constant 503 
DNA melting behaviour simulation 
498, 499 
DNA sequence analysis 496 
familial adenomatous polyposis 500, 
501 
GC-clamp 497-8, 499, 502 
genomic 502 
heteroduplex molecules 497,501,502 
homoduplex molecules 497 
molecular genetics 500,502 
mutation detection 500,503 
parallel 499 
perpendicular 498,500 
plant genome analysis 786 
protocol 503-8 
resolving power 496-7 
SSCP analysis 503 
two-dimensional DNA typing 502-3 
variants of method 502-3 
deoxycytidine deaminase 333 
deoxycytidine kinase 333 
deoxyinosine (dITP) 558 
deoxyribonucleoside triphosphate (dNTP) 
561,563, 564 
desynapsis 751 
DHER gene 337-8 
diabetes mellitus, insulin-dependent 
(IDDM) 641 
genetic mapping 82 
dideoxy sequencing 514, 558, 560 
filamentous phages 562 
phagemids 563 
plasmid sequencing vectors 563 
vectors 562-3 
dideoxynucleotides, dye-labelled 610 


dideoxyribonucleoside triphosphate 
(ddNTP) 561,563, 564 
differential replication banding 
early and late 155 
see also pre-banding 
digital microscopy 
colour reproduction 314 
FISH 30 
fluorescent chromosome band 
enhancement 313 
hardcopy output 314 
image data storage 314 
laser scanning 311-12 
multiple probe detection 312-14 
digoxigenin 218-19, 220, 228-30 
detection with FITC 235-6 
DNA labelling by nick translation 
228-30 
quality control of labelling 230-1 
sequence detection 611 
sequence labelling 602,603 
digoxigenin-FITC Plate 9 
dihydrofolate reductase 337-8 
4th DIMENSION database 809-10 
2,4-dinitrophenyl (DNP) sequence 
labelling 602 
dinucleotide repeats 47, 108-9 
Généthon maps 101, 102 
loci 110-11 
typing 110-11, 119-21 
dioxetane substrates, chemiluminescent 
604 
diploid hybrids, plant genome analysis 
750 
direct transfer electrophoresis, sequence 
transfer 608 
disaggregation 
enzymatic 191, 197-8 
mechanical 190-1, 196-7 
disease gene 
LINKMAP 83 
mapping 81-2 
disease susceptibility allele 37 
disease-resistance gene tracking 779 
dispersed repeat element pre-association 
114-15 
distamycin A (DA) 155 
DNA 
extraction, from P1 clones 414-16 
fragment denaturing gradient 496 
gyrase binding in E. coli 719 
high molecular weight 
liquid preparation 380 
preparation in agarose 378-9 
isolation from agarose plugs 226, 227 
ligase 106-7 
melting behaviour simulation 498, 499 
mismatch repair genes 500 
packaging 370 
polymerase 500,566 
reduced fingerprint 134 
single-copy content of genome 888-9 
single-stranded template 569-70 
universal amplification 269 
see also dideoxy sequencing; insert DNA; 
sequence; sequencing 
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DNA damage 265 
acid-induced 266 
mitotic spindle inhibitors 266 
sources 265-6 
synchronizing agents 265-6 
DNA fingerprinting 3, 113, 128, 129, 130 
data interpretation 132 
linkage analysis 131 
multilocus 131 
practice 132 
probes 131 
protocol 136-7 
statistics 132 
DNA markers 
mapping 325 
mouse genome mapping 635 
rice linkage analysis 803,804 
rice physical map 806, 807 
DNA polymorphisms 99 
Aresidues 110 
applications 99 
CEPH maps 101 
CHLC maps 101 
classes 99 
cosmids 102-3 
defined clone 102-3 
dispersed repeat element pre- 
association 114-15 
EUROGEM maps 101 
finding 100-2 
Généthon maps 101 
identifying new 102-12 
informativeness 99-100 
large collections 107-12 
map placement 100, 100-1 
multilocus methodology 113 
screening for 110 
sequence analysis 104-6 
sequenced genes 103 


Southern blot hybridization 102, 103-4 


SSCP analysis 103, 104, 105, 115-17 
tandem repeat variability 108-9 
YACs 102-3 
DNA probes 
denaturation 232-3 
detection of hybridized 234-7, 257-9 
mapping /ordering by FISH Plate3 
microdissected region-specific 282-6 
plant genome analysis 754 
post-hybridization washes 233-4 
prehybridization 232-3 
rice genetic linkage map 786-801 
DNA typing 128 
allele frequencies 130 
anonymous coding 131 
cell line identity 128 
confidentiality 131 
DNA sample quality control 128 
family relationship verification 128 
hypervariable minisatellite loci 132 
locus-specific minisatellite probes 
132-4 
paternity analysis 128, 130 
Statistical evaluation 129-30 
systems 128-9 
DNASTAR program 733 


dNTP see deoxyribonucleoside 
triphosphate (dNTP) 
DOMAINER program 771 
dominant selectable markers 334-5 
eukaryotic expression vectors 335-6 
negative selection 338 
positive selection 335-8 


positive-negative bidirectional selection 


338 
promoter 334 
selection schemes 335-8 
transfer into mammalian cells 334 
DOP-PCR amplification see degenerate 
oligonucleotide primed PCR 
(DOP-PCR) 
double minutes 189 
alveolar rhabdomyosarcoma 195 
double-cos vectors 375 
doubled haploid lines 786 
Drosophila Genome Centre mapping 
project 674, 679-82, 683 
in situ hybridization mapping 680 
Drosophila melanogaster 632,668 
Bridge’s map 670, 671-2 
cDNA sequences 681 
chromosome number 887 
clone ordering by in situ hybridization 
674, 675 
contig assembly 678 
copia elements 673,678 
cosmid clone availability 679 
cosmid fingerprinting 678 
cosmid library construction 677 
cytogenetic mapping 669, 671-2 
cytogenetics 669, 670, 671-2 
Duncan map 676 
euchromatin 669, 671 


European Consortium cosmid map 674, 


676-9, 683 
evolutionary relationships 672 
FlyBase database 681, 682 
foldback DNA 673 
geneticmapping 668-9 
genome sequencing 682 
genome size 888 
genome structure 672-3 
Hartl map 676 
heterochromatin 669 
histone genes 673 
large-scale sequencing 681 


long-terminal inverted repeat elements 


673 
mapping projects 674-83 
mitotic chromosomes 669, 670 
model system 668-9, 670, 671-4 
molecular genetics 672-4 
mutations 668-9 
P1 clone availability 681-2 
Pi clone library 676 
Pl library 680 
Pelement 673-4, 680-1 
insertions 679 
P element-mediated germline 
transformation 674 


polytene chromosomes 669, 670, 671-2 


676, 677-8 
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ribosomal RNA genes 672-3 
satellite DNA 672 
sequence tagged sites (STSs) 678-9, 
680-1 
transposable elements 673 
transposon tagging 674 
YAC maps 674, 675-6 
Drosophila pseudoobscura 683 
Drosophila viridis 682-3 
dwarf gene, cereals 811-12 
dystrophin gene 10 


E. coli F factor 371 
e-mail 812,814 
EBNAI replication protein 475 
ebnavirus 473 
ECO2DBASE 736 
EcoCyc database 735 
EcoGene database 735 
EcoMap database 735 
EcoSeq database 735 
electro-osmotic flow 586 
electrophoresis 578-9 
band width 578,579 
band-broadening 578 
band-spacing 578,580 
capillary gel 585-7, 593 
diffusion 578,579, 580 
direct blotting gel 579, 590-2 
direct transfer 584—5 
gel for automated sequencing 592-3 
Joule heating 578,579, 580 
limiting factors 580 
molecular orientation 578,580 
pulsed electric fields 579 
resolution 578 
sequence transfer 608 
slab gel 579-85 
standard sequencing gel 588-90 
Encyclopedia of the Mouse Genome 
639-40 


enhancer trap, Arabidopsis nuclear genome 


769 
enterobacterial repetitive intergenic 
consensus (ERIC) sequence 728 
enucleation 357 
from plastic bullets 351-3 
Percoll gradient 353-4 
epifluorescence microscopy 306-11 
arc lamp 306-7 
CCD camera 308-10 
digital cameras 307 
digitalimage 307 
electronic cameras 307-8 
image digitization 311 
Kohler illumination optics 306,307 
linear filter 313 
multiple fluorochrome imaging 
310-11 
multiple probe detection 313 
objective lens 307 
ratio labelling 313 
silicon intensified camera 308 
video cameras 307,308 
Epstein-Barr virus 152, 473, 660 
transformation of blood cells 871 
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vector for cDNA library construction 
473, 474 
Escherichia coli 632,717 
Arabidopsis genes cloned 767 
base composition 726 
Chisequence 728 
chromosome 720,725-9 
number 887 
replication 726 
circular chromosome 718 
Clarke—Carbon bank 719,721 
codon usage 726 
Colibri 735 
complementation 724 
complementing fragment direct 
identification 724 
conjugation 723-4 
followed by complementation 724 
conserved sequences 738 
contigs 725 
editing 733 
cotransduction frequency 724 
CTAG sequence 726 
databases 735-6 
DNA 719,720,723 
gyrase binding 719 
replication 726 
sequence 725-6 
DNA polymerase I 561 
binding 719 
EcoCye 735 
EcoGene 735 
Ecomap5 721 
EcoMap 735 
Ecoseq 735 
G+C content 725, 734 
gene 
analysis 733-4 
annotation 733-4 
arrangement 725, 726-7 
order conservation 727 
products 726 
rearrangement 727 
Gene Database 736 
Gene-protein Database 736 
genetic mapping 724-5 
Genetic Stock Centre (CGSC) 735 
genome 718-19, 720 
nomenclature 717 
segmentation 734 
sequence 710 
sequencing projects 729-34 
size 888 
gray holes 734 
Harvard University genome sequencing 
730-1 
Hfr strains 723 
I-Scel digestion 734 
insertion sequences 728 
interspersed repeats 727-8 
inversions 727 
K-12 718, 737-8 
Kobe University genome sequencing 
730 
Kohara clone library 719,721 
Kohara map 724 


lambda clones 719 
lambdoid phages 728 
life cycle 717 
linkage map 722 
Moco (molybdeum cofactor) mutant 
781 
model system 717-18 
open reading frames 725,733 
physical map 719, 721-2, 722 
methods dependent on 724-5 
plasmids 719 
potential gene identification in 
provisional sequence 733 
promoter location 733 
random clones for sequencing 731 
REP sequences 719 
repeated sequences 727-9 
replication 719 
replicons 719 
reverse genetics 724-5 
Rhs elements 728 
ribosomal protein 726 
ribosomal RNA 726 
sequence similarity 725 
sequencing process 732 
short multicopy palindromic repeats 
728-9 
shotgun preparation 731 
SWISS-PROT 736 
termination 719 
terminus region 726 
Tokyo University genome sequencing 
730 
transcriptional unit orientation 727 
transduction 723,724 
transposable element Tn10 724 
transposon transmission 728 
Wisconsin University group genome 
sequencing 730,731—-4 
essential thrombocythaemia 166 
ethernet 810 
ethidium bromide 296 
ethotrexate 337 
euchromatin, Drosophila melanogaster 669, 
671 
EUCIB mouse-backcross map 425-6 
eukaryotic expression vectors 335-6 
eukaryotic genes, functional analysis 631 
EUROFAN 711 
EUROGEM maps 44, 45, 46, 47,101 
European Collaborative Interspecific Back- 
cross (EUCIB) programme 638, 640 
European Consortium cosmid map 674, 
676-9, 683 
European Gene Mapping Project 
(EUROGEM) 85 
European Scientists Sequencing 
Arabidopsis (ESSA) project 770,771, 
774-5, 778 
Ewing’s sarcoma, media for cytogenetics 
196 
exonamplification 443-4 
artefacts 446 
efficiency 445-6 
specificity 445-6 
exon DNA cloning 454-5 
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exon trapping 320, 442-3 
chimaeric exons 446 
protocol modification 447 
pSPL1 445-7, 449-56 
exonuclease IIT 523,524, 536,558 
expectation maximization (EM) algorithm 
49,92 
expressed sequence tags (ESTs) 319, 514 
Arabidopsis 
geneticmaps 765 
nuclear genome 769-73 
Caenorhabditis elegans 691 
mapping 320 
rice 776 
cDNA clones 783 
marker for genetic linkage map 802 
extracellular proteins, transient expression 
screening 477 


F2 progeny 786 
factor IX 655 
familial adenomatous polyposis (FAP) 
APC gene 503 
mutations 500, 501 
de novo mutations 10 
linkage mapping 28 
mutation penetrance 10 
phenotype variation 9-10 
family 
genetic map data 48-50 
inbred 38 
loops 38 
structure 23 
Fanconi’s anaemia 476 
FASTMAP program 49 
fibre distributed data interface (FDDI) 
810 
file transfer (ftp) 815-33 
FITC see fluorescein isothiocyanate (FITC) 
FKHR gene Plate 2 
flow cytometer, dual-laser 296 
flow cytometry 292 
chromosomes 190 
type differentiation 294 
flow karyotype 292 
bivariate 293 
human fibroblast chromosomes 294 
variations 292-3 
flow karyotyping of chromosomes 298 
flow sorting 108 
abnormal chromosomes Plate 5 
chromosome preparation 
magnesium sulphate method 301 
polyamine method 299-300 
with PCR 294 
see also fluorescence-activated flow 
sorting 
flow-sorted chromosomes 
Alu-PCR amplification 252-3 
DOP-PCR amplification 253-5 
fluorescein, sequence labelling 602 
fluorescein isothiocyanate (FITC) 215 
biotin detection 235 
digital camera imaging 310 
digoxigenin detection 235-6 
PCR product signal development 296 
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fluorescence 
banding techniques 313 
plus Giemsa method 216 
fluorescence in situ hybridization (FISH) 
147,215 
alpha satellite probes 217-18 
Alu-PCR 227-8 
applications 215 
banding 216 
biotin 218-19, 220 
labelling 228-31 
biotin-labelled probes 243 
BrdU incorporation 223-5 
chromatin release 222, 225-6 
chromosomal DNA 
denaturation 231-2 
hybridization 232-3 
chromosomal in situ suppression (CISS) 
218 
chromosome microdissection 
combination 242 
digital microscopy 30 
digoxigenin 218-19, 220 
labelling 228-31 
digoxigenin-labelled probes 243 
DNA 
isolation from agarose plugs 226, 227 
labelling 228-30 
probe mapping/ordering Plate 3 
fluorescence plus Giemsa method 216 
fluorochromes 877-8 
hybridized probe detection 234-7 
interphase 195 
analysis Plate 2,189 
cells 248 
nuclei 216-17 
long-range physical map positional 
information 424 
metaphase chromosomes 215-16 
microdissected region-specific probes 
282-6 
microdissection 246, 282-6 
monosomy 7 in leukaemia 250 
multicolour techniques Plate 9, 248 
multiple probe detection 312-14 
nick translation 21 9, 228-30, 250-2 
pepsin pretreatment 219-20 
plant genome analysis 754,755 
post-hybridization washes 233-4 
pre-banding 222 
probe 217-18 
detection 220 
DNA denaturation/ pre-hybridization 
232-3 
labelling 218-19 
mapping 220-2 
signal analysis 220-2 
suppression 218 
proteinase K pretreatment 219-20 
R-banding 216 
released chromatin 217 
replication G-banding 223-4 
replication R-banding 224-5 
slide preparation 215-17 
solid tumour cytogenetics 189, 193-5 
somatic cell hybrid characterization 344 


TRAP gene mapping 221 
whole chromosome painting probes 
Plate 4, Plate 5 
YACs 218 
see also chromosome painting 
fluorescence-activated flow sorting 292-6 
analysis 242 
chromosome 
library construction 298 
paint generation 298-9 
preparation 298, 299-302 
fluorochromes 292, 296 
instrumentation 296-7 
see also flow sorting 
fluorigenic substrates, sequence labelling 
606 
fluorochrome-conjugated nucleotides 218, 
219 
fluorochromes 147,215, 877-8 
fluorescence-activated flow sorting 292, 
296 
imaging with digital camera 310-11 
laser scanning microscopy 312 
multiple 313 
probe labelling 219 
5-fluorocytosine/cytosine deaminase 338 
FlyBase database 681, 682 
folate deficiency 152 
foldback DNA, Drosophila melanogaster 
673 
forensic practice 
application of DNA typing 128 
minisatellite loci band-shift effect 133 
form filling 808 
fragile X syndrome 10 
cytogenetic analysis 159 
detection 152-3, 172-3 
trinucleotide repeats 111 
framework map 45 
flowchart algorithm for construction 91 
fungal infection 326 
fusogens 324, 326-7 


G418 sulphate/neomycin 
phosphotransferase II 335-6 
G418-resistance gene 675 
G-11 banding 155 
G-banding 152, 154-6 
breast carcinoma 192 
FISH 223-4 
karyotypic analysis of somatic 
chromosomes 748-9 
microdissection 266 
protocol 179-80 
restriction endonuclease 155 
Gal4 promoter 674 
GAMA medium 333 
ganciclovir 658 
ganciclovir/herpes simplex virus 
thymidine kinase 338 
GC-clamp 497-8, 499 
gel digest fingerprinting 422, 423 
gel electrophoresis, sequencing 572 
gene 
expression regulation 631 
function 631 
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fusions 147 
mapping with microcloned DNA 273 
tagging for map-based cloning 802-3 
transfer 
accuracy 661-2 
adenoviral vectors 662 
gene isolation from genomic DNA 442-5 
cDNA enrichment 443-4, 447-8, 458-62 
coding sequence isolation 444-5 
CpGislands 444 
direct cDNA hybridization 447 
direct cosmid/YAC hybridization to 
cDNA filters 456-7 
exon amplification 443-4 
exon trapping 442-3 
by pSPL1 445-7, 449-56 
5’RACE 464-6 
rapid amplification of 3’ends 462-4 
gene therapy 650 
ADA deficiency 662, 663 
AIDS 652, 660-1 
cancer 656-9 
candidate disease 650-2 
corrective 657 
cystic fibrosis 652-3 
cytotoxic 657-8 
delivery 652-3 
systems 661-3 
vehicle 651 
ethics 650 
germline 650 
heritable potential 650 
HIV infection 660-1 
immunotherapy 657, 658-9 
infectious disease 659-61 
monogenic disorders 650-1, 652-6 
multifactorial genetic disorders 656-9 
physiological defect correction 651, 652 
regulation of expression 651 
severe combined immune deficiency 
(SCID) 653-5 
somatic 650 
target cells 651-2 
thalassaemia 654, 655 
vector delivery 661 
vectors 652-3, 653-4 
viral infection 659-61 
GENE/COMBIS electronic journal 805 
Généthon human genome map 101,102, 
425 
markers 101 
genetic character segregation 
meiotic 6-7 
non-independent 6 
genetic disorders 
chromosomal changes 160 
multifactorial 656-9 
translocation 147 
genetic distance measuring 6-7 
genetic mapping 
Drosophila melanogaster 668-9 
heuristics 90 
geneticmaps 44-7 
disease mapping 81-2 
family data 48-50 
high-density linkage 3-4 


IX INDEX 
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integration with physical 85 
interference 50 
likelihood 48 
locus distance 44 
lod scores 49 
marker position 49 
MultiMap 90-5 
multiple two-point analyses 49-50 
multipoint 7,52 
order inference 47-8 
polymorphisms 47 
probability computing 48-50 
protocols 50-84 
recombination information 47-8 
sex differences 44 
shared resources 46 
types 45 
genetic markers 8-9 
genetic polymorphisms 3,99 
genetic recombination 7 
genetic variation 3 
genome 
comparison 631 
definition 746 
mapping 
Diptera 682-3 
random clone anchoring 433-8 
mismatch scanning 113 
scanning of E. coli K-12 737 
sequence of Caenorhabditis elegans 
689-90 
size 888-9 
see also human genome map 
Genome Data Base (GDB) 57,812 
allele frequency 136 
genome resource centres 846-7 
World Wide Web 847-61 
genome-restructuring genes 755 
genomic DNA 
cloning 447 
exon trapping by pSPL1 449-56 
labelling by nick translation 206-9 
preparation 372 
production of clonable 383-4 
genomic hybridization analysis 
alveolar rhabdomyosarcoma Plate 1 
see also comparative genomic 
hybridization (CGH) 
genomic in situ hybridization, plant 
genome analysis 754,755 
genotoxicity assays 295 
genotype 9-10 
genotyping errors 93 
GenProTec database 736 
GenQuest server 624 
germ cell tumours, media for cytogenetics 
196 
germline 
mutation rate 133 
P element-mediated transformation 
674 
gestation, cytogenetic analysis 156-7, 159 
Giemsa banding see G-banding 
gliadin profiles 753 
global repeats 621,624 
inverted 621-3 


B-globin chains 654,655 
globin gene family transcriptional control 
655 
Glrb gene 640 
glycerol kinase deficiency gene 445 
Gopher 816,833 
gpt gene 336,338 
granulocyte-macrophage colony- 
stimulating factor (GM-CSF) 472 
graphical user interfaces (GUIs) 808, 809, 
811,812 
grasses 
classification 747 
synteny 811-12 
gray holes, E. coli 734 
growth hormone, human genomic locus 
626 


haematological malignancy 160, 162 
genes in translocations/inversions 162 
symptoms 164-6 

haemophilia, denaturing gradient gel 

electrophoresis 500 

haemophilia A 22 

haemophilia B 655 

Haemophilus influenzae 
genome sequence 736, 737 
genome size 888 

hairy-cell leukaemia (HCL) 165 

haplome 746 

haplotypes 
analysis 21-2 
construction 24 

hapten, probe labelling 218 

HAT medium preparation 346-7 

HAT selection 332-3 

head and neck tumour, media for 

cytogenetics 196 

hepatitis B 660 

hepatocellular carcinoma 660 

hereditary non-polyposis colon cancer 

(HNPCC) 10 
herpes simplex virus thymidine kinase 
gene (HSVtk) 338, 658, 659, 660-1 

heterochromatin 669,671 

heteroduplex formation 112 

heterogeneity 33-5 

heterokaryons 324 

heterozygosity 99-100 

heuristics 
genetic mapping 90 
probe ordering 429-30 

Hgal 107 

Hinfl fragment 104, 105 

hisD gene 337 

histidinol/histidinol dehydrogenase 337 

histone genes 673 

HIV infection, gene therapy 652, 660-1 

HLA-DQ-locus 128 

Hodgkin's disease 165 

Hoechst 33258 dye 292, 293,296,297 

homeobox genes 780 

homogenously staining regions 189 

homologous recombination 334 

Hordeum vulgare, banding analysis 749 

horseradish peroxidase 
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enhanced chemiluminescent substrates 
604-5 
sequence labelling 603 
host, computer network node 810 
hph gene 336 
HPRT gene 338 
hsp70 promoter 675 
HUGO chromosome committees 881-5 
human chromosome 
endogenous selection genes 329-31 
size 887 
human disease genes 879, 880-922 
positional cloning 325 
yeast gene similarities 710 
human gene mapping 8-16 
de novo mutations 10 
family data 9 
family studies 8-10 
geneticmarkers 8-9 
genotypes 9-10 
linkage analysis 10-11 
linkage maps 8 
penetrance 10 
phenocopies 10 
phenotypes 9-10 
human genome 513-16 
morbid anatomy 923-42 
Human Genome Data Base (GDB) 90 
human genome map 45 
Généthon 101, 102,425 
Human Genome Mapping Project 51 
Human Genome Project, model organisms 
634 
HUMGHCSA genomic region 626-7 
Huntington’s disease 10 
gene 442 
linkage mapping 28 
trinucleotide repeats 111 
hybrid phenotype mapping 323 
hybridization 
locus-specific probes 103 
multiple-copy probe 422-3, 424, 425 
phenotypic changes 338 
sequence detection 600, 611-12 
sequencing by (SBH) 568-9, 573 
single-copy probe 422,423-4, 425 
see also comparative genomic 
hybridization (CGH) 
hygromycin B/hygromycin B kinase 336 
hyperekplexia 641 
hypertext markup language (HPML) 834 
hypoxanthine phosphoribosyltransferase 
382-3 


icons 809,810 
identity-by-descent 36 
identity-by-state 36 
IGD/X-PED system 53, 54-6, 57, 58-60 
chromosome screen 57 
data management 57 
disease gene pinpointing 83 
image analysis 306 
immunomodulatory gene 
delivery 657,658 
expression 657 
immunotherapy, gene therapy 657, 658-9 
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Imperial Cancer Research Fund, WWW 
server 817 
in situ hybridization 
plant genome analysis 754 
see also fluorescence in situ 
hybridization; genomic in situ 
hybridization 
inbreeding loops 38 
inclusive map 45,92 
infectious disease, gene therapy 659-61 
infertility 159 
inner product mapping 425-6 
insert DNA 
phosphatase treatment of partial digests 
404 
preparation for cosmid libraries 376 
Integrated Genomic Database (IGD) 47 
project 52-3 
public data 56-7 
Integrated Services Digital Network 
(ISDN) 810 
integration of genes into chromosomes 
334 
integrins, host cell 471-2 
intensified silicon intensified target 
camera 308 
interference 7, 8,50 
intergenic repeat unit (IRU) sequences 728 
intergenomic affinity, plant genome 
analysis 750 
internal repeat recognition 624-5 
Internet 810 
commercial service provision 810 
resources 805 
interphase cytogenetics 248 
interphase nuclei 216-17 
chromatin release 225-6 
detection rate 248 
ordering of FISH probes 221-2 
structural chromosomal abnormalities 
248 
interspecific hybrids 324 
interspersed repeat sequence PCR (IRS- 
PCR) 242, 244-5, 344-5 
EUCIB mouse-backcross map 425 
high-resolution genetic mapping 638 
markers 644 
seed contig extending 644-5 
segregation analysis in mouse 635 
YAC clones 374-5 
interspersed repeat sequences, physical 
maps 644-5 
intracellular proteins 
screening by in situ labelling 477 
transient expression screening 471 
intron-exon structure 103 
introns, yeast 701-2 
irradiation and fusion gene transfer (IFGT) 
324, 325, 342-4 
fusion 343 
selection 342-3 
irradiation fusion hybrids 319 
IS elements, E. coli 728 
isozymes 753 


JANET 810, 811 


Japanese Rice Genome Research Program 
776 
JOINMAP program 764 
journals 885 
electronic 805 


kanamycin resistance 375 
karyoplasts 341-2 
karyotypic analysis 
chromosome banding 748-9 
computer-aided 748 
conventionally stained 
somatic/pachytene chromosomes 
747-9 
Giemsa-banded somatic chromosomes 
748-9 
mitotic 748 
pachytene 748 
satellited (SAT) chromosomes 748 
karyotyping 167-8 
kinetochore staining 155 
Klenow fragment 558,560,566 
enzymatic DNA sequencing 563-4, 565 
sequence analysis of PCR products 571 
knock-out tables 943, 944-56 
Kohara clone library of E. coli 719,721, 
724 
Kohara map of E. coli 724 
Kohler illumination optics 306, 307 


label multiplexing 606 
lacZ gene 674 
lacZ promoter 562 
laser scanning microscopy 311-12 
confocal 312 
laser scanning technology 306 
leukaemia 
acute 164 
monosomy 7 250 
see also acute and chronic leukaemias 
liability classes 30 
breast cancer 31,32 
ligation, template-directed 568 
light gene 669 
light-signalling pathway 780 
likelihood 48 
likelihood function, CRI-MAP 61-2 
linkage 6 
heterogeneity 33, 34-5 
linkage analysis 
affected individuals 32-3 
age 32 
breast cancer 30-3 
efficiency 39 
human 10-11 
IGD/X-PED system 45-6, 54-6 
lod scores 11-12 
phenotypic heterogeneity 34 
polymorphic markers 39 
recombination fraction 31,33 
risk of disease 33 
unaffected individuals 32-3 
linkage map 
consensus 93 
construction 90 
human 8 
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linkage markers 
not segregating in Mendelian pattern 
23-4 
segregation with disease 24 
LINKAGE program 11,12, 17,22 
alleles 39 
disease mapping 82, 83 
files 50 
input files 57-61 
liability classes 30 
probands 38 
linkage studies 
haplotype analysis 21-2 
human 17,21-2 
markers 17,21 
mode of inheritance 37,38 
number of families 37-8 
linker-adaptor PCR (LA-PCR) 244, 269 
LINKMAP disease mapping 82, 83 
LIPED program 11,17 
Lipofectin 192, 202-3 
LISP interpreter 93,94 
local area networks 810,811 
locus control region (LCR) 655 
locus distance 24, 44 
lod scores 11-12 
breast cancer 34,35 
calculation 12-16 
CRI-MAP 62 
data collection 17,18, 19-20, 21-2 
LINKAGE input file creation 60-1 
mapping 49 
phase-known vs. phase-unknown 
linkage data 17 
tables 17, 18-21 
long interspersed repeat elements 345 
long-range physical map construction 369, 
422 
chimaeric clone detection 430, 432-3 
cloning system 369-71 
competition of probe to remove 
repetitive sequences 416-17 
computational requirements 426-7 
cosmid libraries 375-6 
construction 397-401 
cosmids 370 
data 422-4 
entry 426 
visualization 427 
error checking 426 
experimental strategy 426 
feedback 426 
fitting clones to probe order 430 
genetic and physical information 
integration 425-6 
genomic DNA preparation 372 
high molecular weight DNA preparation 
378-80 
human chromosome 422 
hybridization 377-8, 417-18 
matrix 422 
inner product mapping 424-5 
least distant neighbour 429, 430 
library resources 369 
map forking 429-30 
most distant neighbour 429, 430 
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multiple-copy probes 422-3, 425 
neighbourhood rules 429, 430 
Piclones 369, 370-1 
DNA extraction 414-16 
P1 library construction 376-7, 401-14 
positional information 424,427 
probe 
contig 427 
ordering 427-9 
random noise detection 430, 432-3 
screening by hybridization 377-8 
single-copy probe 422,425 
data 427-30 
somatic cell hybrids 369 
washing 417-18 
YAC 369,370 
agarose block preparation 373, 391-3 
DNA partial restriction digest 
mapping 394-5 
filter lifts 389-91 
generation of end-specific probes from 
clones 395-7 
library construction 372, 381-8 
size fractionation by PFGE 393-4 
long-terminal inverted repeat elements, 
Drosophila melanogaster 673 
Lophopyrum elongatum 
C-banding 749 
chromosome pairing 751,752 
lung tumours, media for cytogenetics 196 
lymphoblastoid cell lines, flow karyotypes 
294 
lymphocyte 
mitogens 152 
separation of peripheral blood samples 
274 
sterile separations 870 
transformation 870 
lymphokine-activated killer (LAK) cells 
658 
lymphoma 160 
chromosome abnormalities 981 
lymphoproliferative disorders 
chromosome changes 982 
chronic 160 


M13 DNA 
cloning vector 561-2 
detergent extraction 542-3 
magnetic bead purification 543-4 
M13 template 533, 534-5, 558 
detergent extraction method 533,534, 
542-3 
magnetic bead purification 533, 534-5, 
543-4 
silica gel membrane purification 533, 
535,545,545 
standard PEG-phenol method for DNA 
recovery 533,534, 541 
magnesium sulphate, chromosome 
preparation for flow sorting 298, 
301-2 
mail lists 814 
malignancy, cytogenetic analysis 159 
malignant cell 
chromosomal abnormalities 159 
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transformation 656 
malignant hyperthermia susceptibility 22 
map distance 8 
map placement, DNA polymorphisms 
100-1 
map-based cloning 802-3 
target genes 807-9 
MAPMAKER 
mouse genome mapping 637 
rice genetic linkage map 803 
rice genome mapping 785 
mapping function 7-8 
marker alleles 39-40 
marriage loops 38 
Massachusetts Institute of Technology 
(MIT) microsatellite map’ 637, 640 
materials 863-7 
matrix-assisted laser desorption 573 
Maxam-Gilbert sequencing method 
559-61 
Maximum Likelihood Estimate 48,49 
Maximum Likelihood Order 48 
MBx database 640, 643 
media 863-7 
meiosis 
gene exchange 6 
phase-known 17 
phase-unknown 17 
melting domain 498, 499 
genomic denaturing gradient gel 
electrophoresis 502 
meltmap of DNA fragement 498, 499 
Mendelian trait with covariates 28, 29, 
30-3 
Menkes disease gene 442 
menus 808 
mesenchymal tumours, media for 
cytogenetics 196 
met oncogene 333 
metaphase chromosomes 
localization of FISH probes 220 
ordering of FISH probes 221 
methotrexate / dihydrofolate reductase 
337-8 
micro-FISH analysis 270,273 
chromosome region 6q26-—27 DNA 
Plate 7 
microcell hybrids 324,325 
applications 323 
microcell-mediated chromosome transfer 
(MMCT) 324-5, 339-42 
colcemid conditions for micronucleation 
341 
filtration 341-2 
fusion 342 
human monochromosomal hybrids 334, 
685: 
karyoplasts 341-2 
microcell isolation/enucleation 341 
micronucleation 340-1 
timetable 340 
microcells 
filtration 354-5 
fusion to whole recipient cells 355-7 
microclones 
characterization 271-3 
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chromosomal regional specificity 272-3 
frequency determination of repetitive 
and unique 272 
human origin confirmation 272-3 
insert 
colony PCR 287-8 
isolation by colony PCR 287-8 
size 271 
isolation of potential candidate genes in 
disease 273 
library construction 270-3 
redundancy level 272 
Southern blot analysis 272 
microcloning 273 
chromosome harvesting 277-8 
DOP-PCR amplification products 286-7 
microdissection 268-9 
microdeletion syndromes 242 
microdeletions 158 
microdissection 148, 265 
amplification reaction 270 
applications 265,273 
cell culture techniques 266 
chromosome 
harvesting 277-8 
preparation 265-6 
chromosome 6 268 
colony PCR 287-8 
contamination 269-70 
cryopreservation 275-6 
cytogenetic analysis 273 
DOP-PCR amplification reaction 280-1 
equipment 267 
FISH 246, 282-6 
libraries 108 
lymphocyte separation of peripheral 
blood smaples 274 
micro-FISH analysis 270 
microcloning 268-9, 286-7 
microneedle preparation 267,278-9 
microscopic 265, 266-8 
plant genome analysis 755 
region-specific libraries 273 
technique 279-80 
translocation breakpoints 273 
universal DNA amplification 269 
unsynchronized cultures 276-7 
micronucleation 356-7 
optimization 350 
microsatellite map 
MIT 637,640 
mouse genome mapping 637, 640 
polygenic loci 640-1 
microsatellites 
repeat polymorphisms 46-7 
see also short tandem repeats 
Miller-Dieker syndrome 242 
minisatellite variant repeat PCR (MVR- 
PCR) 129, 130, 134 
data interpretation 135 
protocol 140-2 
minisatellites 
allele drop-out 134 
allele frequencies 133-4 
band-shift effect 133 
duplication 110 
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hybridization screening 112 
hypervariable loci 132 
locus 100 
properties 132 
typing 112, 128-9, 133-4 
locus-specific probes 132-4 
protocol 138-40 
match criteria 133 
PCR analysis 129, 134-5 
probes for screening 112 
single locus typing 128-9 
variability 112 
minisequencing, solid-phase 107 
mitochondrial DNA typing 128 
mitogens 166, 167 
mitotic karyotyping 748 
mitotic spindle inhibitors 266 
MLH1 gene 500 
modem 810 
mogA 781 
monogenic disorders 650-1 
complex 655-6 
mononucleotide repeats 110 
monosomy 7 in leukaemia 250 
mouse gene knock-out tables 943 
double knockouts 955-6 
targeted mutations 944-55 
mouse genetic map 
comparative with human genome 641 
polygenic loci 640-1 
using 640-1 
Mouse Genome Database 639-40 
mouse genome mapping 634 
accessing information 639-40 
back-cross analysis 636-7 
candidate gene confirmation by 
positional cloning 645 
current status of map 637-8 
DNA markers 635 
high-resolution map 638-9 
interspecific/intersubspecific genetic 
cross 635-6, 637 
methodologies 635-7 
microsatellite map 637, 640 
nomenclature 640 
physical 641-5 
clone resources 641-2 
contig closure 644 
databases 643 
high-resolution genetic maps 643-4 
interspersed repeat sequences 644-5 
uses 645 
publications 640 
recombinant inbred strains 637 
strain distribution pattern 637 
strategies 634-41 
mouse mutation 634 
candidate gene identification 640 
mapping as prelude to positional 
cloning 640 
MSH2 gene 500 
multicomponent complexes, transient 
expression screening 471-2 
multidrug resistant protein (MDR-1) 659 
multifluorochrome techniques 306 
MultiMap 90-5 


consensus map for chromosome 1 94 
CRI-MAP 91-2, 93,94 
documentation 94 
LISP interpreter 93,94 
locus markers 91 
mailing list 95 
software 94 
troubleshooting 95 
uses 92-3 
multiple myeloma 166 
multiplex sequencing 524-6 
chemiluminescence reaction 525 
hapten labels 526 
probe labelling 524-5 
sequence detection 611-12 
single vector 
with multiplex probe labelling 525-6 
with multiplex tagged primers 525 
streptavidin bridge 525 
tagged vectors 525,526 
multiplex tagged vectors 525 
Mus castaneus 635,637 
Mus spretus 635, 636, 638 
mutagenesis, rat model 296 
mutation detection 496 
denaturing gradient gel electrophoresis 
500, 503-8, 503 
mutations, de novo 10 
MutHLS mismatch repair proteins 113 
myb oncogenes 780 
MycDB 835 
Mycobacterium database 835 
mycophenolic acid/xanthine 
phosphoribosyltransferase 336 
mycoplasma 326 
Mycoplasma genitalium genome sequence 
736, 738 
mycosis fungoides 165 
myelodysplastic syndrome 160, 163 
chromosome changes 980 
chromosome painting 249 
classification 980 
symptoms 165 
myeloproliferative disorders 160 
chromosome changes 980 
classification 981 
myotonic dystrophy 10 
trinucleotide repeats 111 


N-banding, karyotypic analysis 748, 749 
near-isogenic lines (NILs), quantitative 
trait loci mapping 804 
neo gene 335-6 
neomycin phosphotransferase II 335-6 
neonates, cytogenetic analysis 157, 159 
nested deletion 514,523-4, 526 
sequencing projects 524 
Netscape software 833, 847 
network services 813-35 
bulletin boards 813, 814-15 
e-mail 812,814 
file transfer 815-33 
Gopher 816,833 
mail lists 814 
newsgroups 814-15 
Telnet 835 
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World Wide Web 817, 833-5 
neurofibromatosis type 2 suppressor gene 
442 
newsgroups 814-15 
nick translation 206-9, 219, 250-2 
DNA labelling with biotin and 
digoxigenin 228-30 
NIH/CEPH maps 102 
nitro blue tetrazolium (NBT) 604 
node 810 
non-Hodgkin’s lymphoma 162,165 
NotI gene 675-56 
nucleolar organizer region (NOR) 747,748 
staining 155,156 


ocean 432, 434, 435, 436-7 
oligonucleotide 
base stacking 568 
fingerprinting 423 
hybridization 524 
labelling with NHS- or ITC-haptens 
612-13 
ligation assay (OLA) 106-7 
multiplex sequencing 524 
*P labelling 524 
primers 
annealing 569 
design 108-9 
dispersed repeats 110 
purification of labelled 607-8 
oligonucleotide-alkaline phosphatase 
conjugate preparation 613-14 
oligonucleotide-enzyme conjugate 606-7 
sequence detection 611 
oncogenes 656 
denaturing gradient gel electrophoresis 
500 
open reading frames (ORFs) 
E. coli 725,733 
K-12 737 
yeast 700-2, 704,710 
orphan receptor cloning 470 
Oryza sativa 776 
see also rice 


P1 clones 
DNA extraction 414-16 
end probes 377 
long-range mapping 369,370-1 
partial digest mapping 377 
screening by hybridization 377-8 

Pi cloning 
artificial system 371 
recombinant DNA packaging into phage 

T4heads 371 

P1 library 
Drosophila melanogaster 676, 680, 681-2 
Drosophila viridis 682-3 

P1 library construction 401-14 
insert DNA preparation 402-5 
long-range mapping 376-7, 401-14 
P1 packaging 410-14 
recombinant clone recovery 410-14 
recombinant DNA production 407-10 
vector DNA preparation 405-7 

P1 maxipreps 377, 415-16 
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P1 minipreps 377, 414-15 
P1 vector 377 
system 642 
po3 657 
Pelement, Drosophila melanogaster 673-4, 
679, 680-1 
Pelement-mediated germline 
transformation in Drosophila 
melanogaster 674 
P-labelled deoxynucleotides 602 
*P-labelled deoxynucleotides 602 
pac gene 336-7 
PAC vector system 642 
pachytene karyotyping 748 
palindromic units 728-9 
papilloma transforming proteins E6é and 
E7 660 
papillomavirus 473 
PARI 23 
PAR2 23 
parent-offspring pair sampling 36-7 
paroxysmal nocturnal haemoglobinuria 
470,476 
partial restriction digest mapping 
Piclones 377 
YACs 374 
paternity analysis 128 
DNA fingerprinting 131,132 
DNAtyping 130,131 
minisatellite loci typing 133 
PAX7-FKHR fusion gene 194 
pCDMS8 vector 474, 475 
modifications 474 
transient expression system 472, 473 
peDNA1 474,475 
pcDNA3 474,475 
PCR 
amplification 3 
specific alleles (PASA) 107 
asymmetric 548-9, 570 
colony 287-8 
direct sequencing of products 569-71 
with flow sorting 294 
markers in plant genome analysis 786 
product recovery 552-4 
symmetric 550 
PCR-RFLPs 118-19 
PDUAL 523 
pedigree 9 
breast cancer families 29,30 
looped 38 
penetrance 10, 28,31 
probability 36 
Percoll gradient enucleation 354 
perdurance 668 
peripheral blood samples, lymphocyte 
separation 274 
Ph1 gene 751-2 
phage 
clones 217 
filamentous 562 
genome size 888 
lambdoid of E. coli 728 
lambda-phage clone 521 
phagemids 533,535,558 
dideoxy sequencing 563 


DNA preparation 533,535, 546 
phenocopies 10 
phenotype 9-10 
changes in hybridization 338 
mapping 338-9 
not segregating in Mendelian pattern 23 
phenotypic heterogeneity 33-4 
phenotypic variation 3 
phenylketonuria 655 
Philadelphia chromosome 151, 160 
translocation 246 
phleomycin/bleomycin and phleomycin 
binding protein 337 
photography, automated 168 
photonic noise 309 
photosynthetic proteins, rice 778 
phyletic relatedness 746 
phyllodes tumour of breast 192 
phylogenetic relationship 747 
physical maps 319-20 
integration with genetic 85 
phytohaemagglutinin (PHA) 152 
phytohaemagglutinin-stimulated 
peripheral blood lymphocytes 294 
pixel binning 309,310 
pJFE14 expression vector 474-5 
plant genome analysis 746-7 
amphidiploids 750-1 
chromosome pairing in hybrids 749-51 
crossability 747 
denaturing gradient gel electrophoresis 
(DGGE) 786 
diploid hybrids 750 
DNA probes 754 
fluorescence in situ hybridization 754, 
755 
genomic in situ hybridization 754,755 
in situ hybridization 754 
intergenomic affinity 750 
karyotypic analysis 747-9 
karyotypic features 747-8 
map-based technology 754 
microdissection 755 
molecular markers 754 
molecular tools 753-5 
PCR markers 786 
preferential pairing 750-1 
protein electrophoresis 753 
random amplified polymorphic DNA 
(RAPD) 754-5, 786 
reproductive isolation 747 
techniques 746 
triploid hybrids 750 
plasmid 192,217 
E. coli 719 
libraries 244 
sequencing vectors 563 
plasmid DNA 
PEG precipitation 540 
sequencing 562 
short alkaline miniprep 532,533, 540-1 
standard alakline lysis miniprep 539 
plasmid templates 531-4 
combined anion-exchange 
chromatography /silica gel-based 
purification 533,534 


PEG precipitation 532,540 
QIAGEN preparation 532-4 
short alkaline miniprep for plasmid 
DNA 532,533, 540-1 
silica gel-based purification 533,534 
standard alkaline lysis miniprep 531-2, 
533,539 
polo gene 668 
poly(A):RNA preparation 479-80 
polyacrylamide 587 
polyamines 298, 299-300 
polycythemia rubra vera 166 
polyethylene glycol 326-7 
selective precipitation 551 
polymerase chain reaction see PCR 
polymorphic markers 3 
polymorphism information content (PIC) 
17,21, 100 
polymorphisms 3, 47,99 
Aresidues 110 
PCR amplified 110 
screening 110 
SSCP analysis protocol 115-17 
substitutional 104, 107 
see also DNA polymorphisms; single- 
stranded conformational 
polymorphism (SSCP) analysis 
polyploid 
heterogenomic 746 
species 746 
population substructuring, allele 
frequencies 130 
positional cloning 
long-range mapping 369 
mouse mutation mapping 640 
Prader—Willisyndrome 242 
pre-banding using Wright's stain 222 
preferential pairing, plant genome analysis 
750-1 
prenatal diagnosis, cytogenetic analysis 
153-4 
primer 
AGK34 227-8 
end-labelled 565,567 
fluorescent labels 571 
insert-specific 567 
modular 568 
primer walking 514, 520-1, 526,558 
DNA size 520 
primer synthesis 521 
sequence detection 607, 609 
with short oligomers 568 
priming sites, mobile 521-2 
primitive neuroectodermal tumours of 
CNS 196 
probe competition to remove repetitive 
sequences 416-17 
probe hybridization/sequence-tagged site 
422 
probe-clone incidence matrix 427 
probeorder program 429, 431,433 
prokaryotic genome sequences 736-8 
prometaphase chromosomes 171-2 
promoter trapping, Arabidopsis nuclear 
genome 769 
propionic acidemia 560 
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proto-oncogenes 656 
protocols for genetic maps 50-84 
protoplasts 476 
pseudoautosomal region 23, 100 
pseudogenes 
cDNA enrichment 448 
yeast 701-2 
pSP64CS 560 
pSP65CS 560 
pSPL1 
exon trapping 442-3, 445-7, 449-56 
library transient expression 451-2 
recombinant construction 449-51 
pSPL3 exon trapping vector 447 
pSV2 expression vector 334 
puberty 159 
pUC18/19 polylinker 563 
puromycin/ puromycin N-acetyl 
transferase 336-7 
PYTHIA program 616-24, 706 


quantitative trait loci mapping 803-4 

quantum efficiency 309 

question and answer dialog 808 

QuickMap 425,427 

quinacrine banding (Q-banding) 155, 156, 
157, 180-2 


R-banding 152,155, 158, 182-3 
FISH 224-5 
whole chromosome painting probes 
Plate 4 
5’ RACE 464-6 
see also rapid amplification of CDNA 
ends (RACE) 
radiation fusion hybrids 
fingerprint 425 
fragment map 425 
long multiple-copy probe 425 
mapping 424 
single-copy probe target 425 
radiation hybrids 108,324, 342-4 
analysis 343-4 
applications 323 
mapping 319,325,342, 344 
production 358 
radiation mapping 342 
radiation-reduced hybrids 342,343 
random amplified polymorphic DNA 
(RAPD) 
bulked segregant analysis 785 
plant genome analysis 754-5, 786 
rice genetic linkage map 802 
rice genome mapping 783-5 
sequence-tagged site determination 785 
random amplified polymorphic DNA 
(RAPD)-PCR 113 
random anchor mapping 433-8 
anchored islands 433, 434, 436, 437 
ocean 432, 434, 436-8 
theoretical predictions 435-8 
undetected overlaps 434, 435, 436-8 
random integration 334 
random probing, long-range physical map 
construction 426 
rapid amplification of CDNA ends (RACE) 


448, 462-6 
ras oncogene 333,657 
recA mutation 376 
recessive characters, z, lod scores 14 
recessive disease, inbred family 38 
recombinant DNA, packaging into phage 
T4heads 371 
recombinant inbred lines (RILs) 786 
recombinant viral vectors 661 
recombinants 47-8 
recombination fraction 7, 8,11 
breast cancer 31 
CRI-MAP 62 
linkage analysis 31,33 
linkage heterogeneity 33 
sex differences 12 
standard error 11 
relational database management system 
(RDBMS) 812 
renal cell adenocarcinoma 193 
renal tumours, media for cytogenetics 196 
Rep protein 719 
REPBASE database 616,619 
repeat analysis 616 
encoding 626-8 
parsing 626-8 
PYTHIA program 616-24 
recognition of internal repeats 621-4, 
624-5 
recognition of known repeats 616-20, 
624 
repeat subfamily identification 624 
repeats, global 621 
repetitive elements 217 
repetitive extragenic palindrome (REP) 
sequences 728-9 
repetitive sequences, probe competition to 
remove 416-17 
replicons, E. coli 719 
representational difference analysis (RDA) 
113 
reproductive isolation, plant taxa 747 
Resource End Database (RED) 53 
restriction enzyme-based fingerprinting, 
Caenorhabditis elegans 688, 689 
restriction fragment length 
polymorphisms (RFLPs) 17,21, 45 
Arabidopsis 766 
chromosome markers 753,754 
identification 103-4 
maps of cereal crops Plate 11,811 
rice 776 
genome analysis 811 
genome mapping 784,785 
map Plate 11 
typing amplified 118-19 
wheat map Plate 11 
restriction fragment length variant (RFLV) 
635, 636, 638 
retroposons 111 
polyadenylate tract 110 
reverse banding see R-banding 
reverse genetics, E. coli 725 
reverse transcriptase, enzymatic DNA 
sequencing 565 
Rhodamine 215,220 
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Rhs elements, E. coli 728 
ribosomal RNA 
E. coli 726 
genes of Drosophila melanogaster 672-3 
rice 776 
cDNA 
analysis 776-8, 779-81, 782-3 
callus proteins 778, 779-82 
clones 776,778 
root proteins 778, 779-82 
chromosome number 887 
expressed sequence tags (ESTs) 776 
RFLP 776 
map Plate 11 
synteny 
with other cereals 811 
with wheat Plate 11 
rice genetic linkage map 786-804 
construction 801-2 
DNA markers 803,804 
DNA probes 786-801 
gene tagging for map-based cloning 
802-3 
high-density RFLP 801 
population mapping 786-801 
quantitative trait loci mapping 803-4 
rice genome 
anatomy 810-11 
informatics 809-11 
library screening for genes of other 
cereals 812 
size 776, 804-5, 888 
rice genome analysis 
comprehensive map 809 
data handling 809 
database 809-10 
international federated genome 
databases 810-11 
map-based cloning of target genes 807-9 
RFLP map 811 
YAC library construction 808 
rice genome mapping 
linkage analysis 784-5 
PCR techniques 783-6 
RAPD analysis 783-5 
RFLP markers 784,785 
single-strand conformation 
polymorphism analysis 785-6 
rice physical map 804-9 
bacterial artifical chromosomes (BACs) 
805 
chromosome isolation 806 
construction strategies 806-7 
cosmid libraries 805-6 
DNA markers 806, 807 
future directions 808-9 
YAC 805-6 
clones 806-7 
rickets, hypophosphataemic 22 
ring chromosomes 157,159 
RNA isolation 190 
cDNA library construction 478-9 
RNA probes 217 
Rpgl gene 812 


%S-labelled nucleotides 602 
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sacB gene 371 
Saccharomyces cerevisiae 631,632,696 
chromosome 696 
number 887 
size 888 
clone library 697 
genetic mapping 696-7 
genome 696 
sequencing 513 
size 888 
model system 696 
see also yeast 
Sanger sequencing method 561-9 


satellite DNA, Drosophila melanogaster 672 


satellited (SAT) chromosomes 748 
Sau3A cosmid digest 445, 446 
scanning microscopy, sequencing 572 
Schizosaccharomyces pombe 422 
chromosome number 887 
genome size 888 
random anchor mapping 434-8 
YAC map 428, 430-1 
Secale cereale banding analysis 749 
secreted proteins, transient expression 
screening 471,477 
SEG program 624 
segregation, non-independent 6 
selection genes, endogenous 329-31, 
332-3, 
Sendai virus 326 
seq, genomic 571 
Sequenase 561,563, 564, 565, 566 
sequence 
characterized amplified regions 
(SCARs) 785 
editing 514 
fidelity 265 
management programs 514 
polymorphism conversion to 
convenient assays 106-7 
similarity in shared repetitive structure 
622 
with six-phase amino acid translation 
514 
subchromosomal mapping 339 
tandemly repeated 217 
transfer 608 
sequence analysis 496,513 
PCR products 571 
PCR-amplified segment 104-6 
programs 514 
sequence detection 600 
charge coupled device 606 
chemical labelling 609 
cost 601 
digoxigenin 611 
DNA transfer 600-2 
tomembranes 608 
enzymatic labelling 609 
enzyme-linked 605 
chemical end-labelling 607 
enzymatic labelling 607 
methods 600, 601, 602-9 
oligonucleotide labelling 607 
sequencing protocols 609 
fluorescent 609-10 


handling 601-2 
hapten-based 611 
hybridization 600 
hybridization-based 611-12 
methods 600-2 
multiple dye sequence machines 610 
multiplex sequencing 611-12 
oligonucleotide-enzyme conjugates 
606-7, 611, 613-14 
primer walking 607 
project size 601 
radioactive methods 600, 601 
silver staining 600, 610-11 
sequence labelling 600 
alkaline phosphatase 603 
alkaline phosphatase-labelled 
antibodies 603 
biotin 602, 603 
chemiluminescent substrates 604-6 
colorimetric substrates 603-4 
cost 601 
digoxigenin 602, 603 
2,4-dinitrophenyl (DNP) 602 
end-labelled primers 602 
enzyme-linked methods 600 
fluorescein 602 
fluorescent 600 
fluorigenic substrates 606 
horseradish peroxidase 603 
incorporation of labelled nucleotides 
602 
isotopes 602 
label multiplexing 606 
oligonucleotides with NHS- or ITC- 
haptens 612-13 
radioactive 600,602 
streptavidin-phosphatase complex 603 
sequence-tagged site (STS) 422,424 
assay of Caenorhabditis elegans 688 
content mapping 642 
Drosophila melanogaster 678-9, 680-1 
rice genetic linkage map marker 802 
rice genome mapping 785 
YAC contigs 642-3 
sequencing 
automation 571-2, 609 
by hybridization 568-9, 573 
Caenorhabditis elegans 689-90 
chemical 559-61 
consistency 561 
degradation method 558 
DNA modification analysis 560 
enzymatic DNA comparison 565-6 
PCR-amplified products 561 
vectors 560 
computer resources 514 
cycle 566-7 
dideoxy 514,558,560, 562-3 
direct of PCR products 560, 569-71 
DNA 
polymerase 566 
sequencing machines 626-7 
Drosophila melanogaster genome 682 
E. coli 726 
random clone preparation 731 
end-labelled primers 567 
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enzymatic DNA 561-9 
5’-end-labelled primers 565 
chemical sequencing comparison 

565-6 
Klenow fragment 563-4, 565 
labelling/termination 563 
reverse transcriptase 565 
Sanger 563 
Sequenase 564 
Taq DNA polymerase 564-5 
techniques 563-6 

enzymatic method 558 

gel electrophoresis 572 

high resolution denaturing 

polyacrylamide electrophoresis 558 

in vivo amplification methods 531-6 

large-scale of Drosophila melanogaster 681 

mass spectrometry 572-3 

membrane development 608-9 

multiplex 524-6, 559 

nested deletions 523-4 

novel techniques 566-9 

PCR technology 559 

primer walking with short oligomers 

568 

primer-directed 567-8 

rice 776-8 
robots 777 

scanning microscopy 572 

shotgun 518-20, 526 

solid phase 559, 570 

strategies 514,518 
directed 520 
nested deletion 526 
ordered 518 
primer walking 520-1, 526 
random 518 
shotgun 518-20, 526 
transposon mediated 526 

technology 514 

template 
amplification/ purification 531 
PCR products 558 
preparation 514 

template generation 569-71 
asymmetric PCR 570 
automation 572 
cycle sequencing of PCR 571 
double-stranded 570-1 
lambda exonuclease-generated single- 

stranded DNA 570 
solid-phase 570 

transposon-facilitated 681 

transposon-mediated 521-2, 523 

walking primers 520-1,526,558 

sequential digestion 523 

severe combined immune deficiency 

(SCID) gene therapy 653-5 

sex chromosome 
cytogenetic analysis of abnormality 159 
locus inheritance pattern 100 

sex linkage 22-3 

sex reversal 345 

Sézary’s syndrome 165 

shaker-1 gene 641 

short interspersed repeat elements 344 


short tandem repeats 21 
see also microsatellites 
shotgun sequencing 514, 518-20, 526 
assembly 519 
contigs 519 
directed 519-20 
editing 519 
library preparation 518-19 
sequence acquisition 519 
silicon intensified camera 308 
silver staining, sequence detection 610-11 
simple regions 621, 623 
simple sequence length polymorphism 
(SSLP), segregation analysis in 
mouse 635, 636 
simple tandem repeats 
DNA typing 129 
PCR typing 135-6 
single-copy probe hybridization, data 
427-30 
single-stranded conformational 
polymorphism (SSCP) analysis 
103, 104, 105 
denaturing gradient gel electrophoresis 
503 
polymorphism 115-17 
rice genome mapping 785-6 
sensitivity 503 


single-stranded DNA binding protein 521, 


568 
slab gel electrophoresis 579-85 
apparatus 582 
automated sequencing 585 
autoradiography 584 
band width 582 
bis-acrylamide 580,581 
blotting 584-5 
buffer 581 
capillary blotting 584 
catalyst systems 581 
crosslinkers 580 
degassing 581 
denaturants 581-2 
direct transfer electrophoresis 584-5 
electric field conditions 582-3 
electroblotting 584 
gel composition 580 
gel dimensions 582 
gel matrix 579-81 
gradient gels 583-4 
Joule heat 582 
loading 582 
manual sequencing 579-85 
polyacrylamide chains 580,581 
sample wells 582 
sharkstooth comb 582 
SMPL program 621-4, 626 
algorithms 626-8 
solid tumour cytogenetics 189-90 
CGH 194-5, 196, 206-9 
technique 189 
chromosome 
direct preparation 198-9 
harvesting 193 
preparation 190-3 
rearrangements 189 
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culture media 190 
direct preparations 191, 198-9 
disaggregation 190-1, 196-8 
fine-needle aspirates 190 
FISH technique 189, 193-5 
harvesting 
by removal of adherent cells 204-5 
in situ 205-6 
Lipofectin-mediated transfection 202-3 
long-term cultures 191-3, 202-3 
media 196 
nick translation labelling of genomic 
DNA 206-9 
Passaging cells 203-4 
RNA isolation 190 
short-term cultures 191 
from cell suspensions 199-201 
from explants 201-2 
tumour imprint 
preparation 209 
pretreatment prior to hybridization 
209-10 
tumour sample 190 
washing 190-1 


solid tumours 


chromosome rearrangements 982-3 
gene amplifications 984 
karyotypes 189 

unsynchronized cultures 277 


solutions 863-7 
somatic cell hybrids 107-8, 323 


auxotrophic mutants 333-4 
biology 324 
cell culture 325-6 
cell source 327-8 
characterization 344-5 
chromosome segregation 328 
cloned DNA mapping 325 
dominant selectable marker insertion 
into mammalian genome 334-8 
donor cells 328 
donor chromosomes 357 
endogenous selection genes 332-3 
enucleation 
from plastic bullets 351-3 
Percoll gradient 354 
genotype characterization 325 
half-selection 333 
HAT preparation 346-7 
HAT selection 332-3 
interspecific 324,325 
irradiation and fusion gene transfer 
342-4 
long-range mapping 369 
marker analysis 344-5 
microcells 
filtration 354-5 
fusion to whole recipient cells 355-7 
micronucleation optimization 349-50 
MMCT-generated monochromosomal 
334, 335 
morphology 327-8 
parental cell selection 328 
phenotype mapping 324-5 
positional cloning of human disease 
genes 325 
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radiation hybrid production 358 
recipient cell lines 328 
selection 328, 329-31, 332-4 
types 324 
whole-cell fusion 338-9, 347-9 
SOX9 gene 345 
SP6 promoter element 560 
spheroplasts 476 
spinal muscular atrophy, trinucleotide 
repeats 111 
Standard Query Language (SQL) 812 
startle disease 641 


strain distribution pattern, mouse genome 


mapping 637 
streptavidin bridge 525 
streptavidin-phosphatase complex, 
sequence labelling 603 
subarray sampling 309 
subchromosomal mapping of DNA 
sequences 339 
subcloning, nest deletion 568 
subgenomic libraries 107-8 
substitutional polymorphism 104, 107 
analysis 108 
supF suppressor tRNA gene 522 
surface proteins, transient expression 
screening 471 
susceptibility allele 37 
SV40 473 
early region 192 
promoter 334 
vector for cDNA library construction 
473, 474 
SV40-based plasmids 475 
SWISS-PROT database 736,771 
synchronization technique for bone 
marrow samples 184-5 
Synchronous Multimegabit Data Services 
(SMDS) 811 
synkaryons 324 
synovial sarcoma, media for cytogenetics 
196 
synteny 811-12 
map Plate 11 
System for Integrated Genome Map 
Assembly (SIGMA) 83 


T7 polymerase 521 
sequence analysis of PCR products 571 
t(9;22) 160 
translocation 151, 167 
T-cell receptor —B heterodimer 471 
T-DNA, Arabidopsis nuclear genome 
767-8, 769 
Tal element 763 
tandem repeat 
sequences 217 
variability 108-9 
Tag DNA polymerase 561 
sequence analysis of PCR products 571 
sequencing 564-5, 566 
Taq polymerase 106,521,558, 565 
Target End Database (TED) 53 
target genes, map-based cloning 807-9 
Tatl element 763 
tat intron 446 
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TAT protein 660 
Telnet 835 
telomere 
banding 155 
trap cloning 699 
yeast 699, 705-6 
temperature, melting (T,,) 496, 498, 499 
temperature gradient gel electrophoresis, 
DNA sequence analysis 496 
template amplification 531 
agarose gel electrophoresis 533, 536-7, 
538 
asymmetric PCR 533,536,537, 548, 549 
biotin-streptavidin system 537, 551-2 
column purification 533,538 
cosmids 533,535-6, 546-7 
detergent extraction for M13 DNA 
542-3 
DNA direct sequencing in LMP agarose 
533, 538, 555 
double-stranded DNA sequencing 
template purification 537, 550 
freeze and squeeze method 533, 538, 
552-3 
in vitro methods 533, 536-8 
M13 templates 533, 534-5 
magnetic bead purification of M13 DNA 
543-4 
PCR product direct sequencing 536 
PCR product recovery 
agarose method 554-5 
column purification 553-4 
freeze and squeeze method 552-3 
PCR-amplified material molecular 
cloning 536 
PEG precipitation 
plasmid DNA 540 
selective 533,537,551 
phagemids 533,535, 546 
phenol/chloroform extraction 533,538 
plasmid templates 531-4 
pure double-stranded 537-8 
short alkaline miniprep for DNA 540-1 
silica gel-based purification 533,538, 
545 
single-stranded DNA sequencing 
template generation 537-8, 548, 
549, 551-2 
standard alkaline lysis miniprep of 
plasmid DNA 539 
standard PEG-phenol method for DNA 
recovery from M13 phage 541 
symmetric PCR 533,536, 537-8, 550 
template DNA of rice 777 
template purification 531 
N,N,N’,N’-tetramethyl-1,2-diaminoethane 
(TEMED) 581 
tetranucleotide repeats 111-12 
Texas red 215 
biotin detection 235-6 
TFASTA program 771 
a-thalassaemia 242 
B-thalassaemia 500 
thalassaemia, gene therapy 654,655 
thermal cycle sequencing 566-7 
Thinopyrum bessarabicum, chromosome 


pairing 751,752 
thymidine 
block synchronization 171-2 
cell synchronization 265-6 
thymidine kinase 332-3 
tissue plasminogen activator 
repeats 619 
sequence 616 
tissue-specific enhancers 655 
tk gene 338 
Tn3 transposon family 521 
Tn5 transposon family 521 
toxicological assay of chemicals 295 
transduction, generalized 724 
transformation, YAC cloning 372 
transient expression screening 470-2 
advantages 471 
cDNA library construction 472-6 
cDNA synthesis 472 
disadvantages 471-2 
extracellular proteins 477 
intracellular proteins 477 
methods 476-7 
procedure 470-1 
screening ligand 472 
secreted proteins 477 
supernatant bioassay 477 
for surface molecules by panning and 
rescue 476, 486-91 
vectors 472-6 
translocation breakpoints, microdissection 
273 
translocations 
balanced 156 
unbalanced 156, 157 
Transmission Control Protocol/Internet 
Protocol 810 
transmission distortion tests 40 
transposable elements of Drosophila 
melanogaster 673 
transposon tagging 
Arabidopsis nuclear genome 768-9 
Drosophila melanogaster 674 
transposon transmission in E. coli 728 
transposon-mediated sequencing 514, 
521-2, 523,526 
mobile priming sites 521-2 
transposon-generated deletions 522, 
523 
TRAP gene mapping 221 
trinucleotide repeats 47, 111-12 
triploid hybrids, plant genome analysis 
750 
trisomies 156-7 
Triticum aestivum, chromosome pairing 
751-2 
Triticum tugidum, C-banding 749 
trpB gene 337 
tryptophan/tryptophan synthase 
(B-subunit) 337 
Tth1111 site 560 
tumour antigen expression 658 
tumour cell transplantation into nude mice 
193 
tumour necrosis factor 659 
tumour suppressor genes 656 
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denaturing gradient gel electrophoresis 

500 

tumour-infiltrating lymphocytes (TILs) 
659 

tumourigenesis 159 

twins, zygosity testing 131 

t(X;18)(p.11.2;q11.2) 194 

Ty elements, yeast 702,706 

tyrosine kinase 641 


undetected overlaps 434, 435, 436-8 
universal DNA amplification 269 
universal relative locator (URL) 833 
UNIX 50-1, 838-41 

activities 838-40 

background / foreground processing 

839-40 

case 838 

commands 838 

comparison with VMS 840 

control-key combinations 841 

deleting 838 

filenames 839 

output control 840 

program execution control 839 

Wildchar characters 840 

working in other directories 839 
upstream acting sequences (UAS) 703 
upstream repressing sequences (URS) 703 


variable number of tandem repeats 
(VNTR) 108-9 
vectors 
adenoviral 662 
delivery 661 
development 662-3 
DNA preparation for cosmid libraries 
375 
viral genes 192 
viral infection 659-61 
viral vectors 662-3 
virus, genome size 888 
viscotoxin 778 
VMS 840-1 


Waldenstrém’s macroglobulinaemia 166 
Webcrawler 820, 834 
wheat 
chromosome markers 802 
chromosome number 887 
RFLP map Plate 11 
synteny with rice Plate 11 
white gene 668, 669 
whole-cell fusion 338-9 
mammalian cells 347-8 
whole-cell hybrids 324, 338-9 
applications 323 
whole-chromosome libraries 248 
whole-chromosome map construction 342 
whole-chromosome paints 246,247 
wide area networks 810,811 
Wilm’s tumour suppressor 780 
World Wide Web 805, 817, 833-5 
databases 847-61 
genome resource centres 847-61 
URLs for live help 841 
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Wright's stain 222 


X chromosome 22 
centromeric probe Plate 8 
genetic recombination 23 
X-specific part 100 
X-linked agammaglobulinema (XLA) 641 
X-linked traits 22 
X-PED 53 
X-Phos see 5-bromo-4-chloro-3-indolyl 
phosphate (BCIP) 
X-Y pairing region 100 
Xa-1 gene 807 
xanthine phosphoribosyltransferase 336 
xeroderma pigmentosa 470,475,476 
XhoI restriction site 270 


Y chromosome, inverted 159 
YAC-to-YAC hybridization 425, 426 
YACs 108, 147 
agarose block preparation 373, 391-3 
anchored framework map construction 
643, 644 
cDNA enrichment 443 
chimaerism 432 
chromosome walking 374 
clones 
end-specific probe generation 395-7 
replication 373 
rice physical map 806-7 
cloning 372 
combination with chromosome paints 
243 
contig 108,422 
assembly 273 
construction 271 
direct cDNA isolation 447 
direct hybridization to cDNA filters 
456-7 
DNA partial restriction digest mapping 
394-5 
expressed sequence isolation 455 
exon-trapping 447 
filters 429 
FISH 218 
Généthon human genome map 425 
inner product mapping 244-5 
insert size 369 
library 
Arabidopsis 765-6 


arraying 372,389 
chimaeric clones 642 
construction 372, 381-8 
Drosophila melanogaster 675-56 
filter lifts 373, 389-91 
mouse 642, 643 
rice 808 
screening 373 

long-range mapping 369,370, 381-8 


map for Schizosaccharomyces pombe 428, 


430-1 
Pl clones 371 


partial restriction digest mapping 374 


PCR 374-5 
physical genome maps 641-2 
pooling 373 
preliminary characterization 373-4 
preparation 
by ligation 384-5 
for transformation 386 
probe 
generation 374 
ordering 429 
rice physical map 805 
screening by hybridization 377-8 
size 
determination 373 
fractionation by pulsed-field gel 
eletrophoresis 393-4 
subcloning 370 
use 372-5 
vector 372 
arm preparation 382-3 
yeast spheroplast preparation 386-7 
yeast 
Arabidopsis genes cloned 767 
ARS elements 705, 709 
base composition 704-5 
chromosome 
breakage 709 
sequence homology 706 
telomeres 699 


chromosome II repeat sequences 706, 


707 
clusters 
duplicated genes 708 
homology regions 708-9 
duplicated genes 708 
functional analysis 711 
GC content 704-5 
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gene 
density 704-5 
organization 702-7 
genetic map 707 
genetic redundancy 707-9 
genome 696 
architecture 702-7 
organization 707-11 
genome project 696, 697-700 
chromosome sequencing 697-9 
cloning 697-9 
mapping 697-9 
nested chromosomal fragmentation 
699 
quality control 699-700 
sequence analysis 74 
sequence assembly 699-700 
sequencing strategies 699-700 
strategy 697 
human connection 709-10 
information sources 714 
intergene intervals 704 
introns 701-2 
open reading frames (ORFs) 700-2, 704, 
710 
physical map 707 
proteome 700-1 
pseudogenes 701-2 
putative membrane proteins 702 
putative mitochondrial proteins 702 
repeat sequences 706-7 
sequence variation among strains 709 
spheroplasts 
preparation 386-7 
transformation 388 
telomeres 705-6 
transcriptional unit arrangement 703 
transformation efficiency 697 
Ty elements 702, 706 
upstream acting sequences (UAS) 703 
upstream repressing sequences (URS) 
703 
see also Saccharomyces cerevisiae 
yeast artifical chromosomes see YACs 


z, lod scores 13-14, 17 
tables 19 

z,lod scores 14-16, 17 
tables 19-21 

zygosity testing, twins 131 
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ICRF Handbook of 
Genome Analysis 


The ICRF Handbook of Genome Analysis is a combination of protocol manual 
and informational resource, with expert contributors drawn from a wide range 
of research centres. It describes and evaluates a wide range of techniques, 
providing step-by-step protocols. The two volumes cover both the human 
genome and genomes of other model organisms. 
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